Skip to main content

Databased

I decided to add code to store the results from my scraper into a database instead of just printing them on the screen.  I also added a bit to capture the full review text as well as the title.  In the process, I learned a few things about SQL calls:

  1. You can't use WHERE with INSERT INTO
  2. If you want to add something to an existing row, you have to use UPDATE...WHERE

I spent a lot of time trying to insert the review into the same row as its associated title.  At first, I was creating a new row for each review.  Then I tried to link the review with my i counter and insert it into the i-th row.  But that went nowhere because (A) the database wasn't associating any numbers with the rows and (B) i counts review pages.  I fumbled around a lot, got my count INTEGER NOT NULL PRIMARY KEY (probably overkill -- I wanted to AUTOINCREMENT too, but I guess I was putting in the wrong command), and then after watching dubiously as the database kept matching negative titles with positive reviews, finally realized I needed to put the review with the entry count and not the page count.

One problem with the app as it's currently set up is that the association between title and review is happen-stance -- or, no information is saved telling me, "This is the n-th review stored, this is the n-th title stored -- so the n-th review and the n-th title belong together."

Tonight I was experimenting a little bit to see if I could get information about ratings for each review.  Unfortunately, my HTML source gives ratings as a list of something like <div class="icon-full_star">'s.  I was actually able to print out all the text containing those tags and, I think, create a list of the tags I wanted using regex -- but that gave me the total stars among all the ratings -- not granular enough.  So I found a website I can review later that might help me get better at using Beautiful Soup.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import sqlite3
import re

conn = sqlite3.connect('reviews.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS Reviews')
cur.execute('CREATE TABLE Reviews (count INTEGER NOT NULL PRIMARY KEY, title TEXT, review TEXT)')

i=1
t=1
while True:
    url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
    page = urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    titles = soup.find_all('div', attrs={'class': 'review-title'})
    reviews = soup.find_all('div', attrs={'class': 'body'})
    if reviews != []:
        for review in reviews:
            cur.execute("INSERT INTO Reviews (review) VALUES (?)", (review.text,))
            conn.commit()
    if titles != []:
        for title in titles:
            cur.execute('UPDATE Reviews SET title=? WHERE count=?', (title.text,t))
            t += 1
            conn.commit()
    else:
        print("No more data.")
        break
    i += 1

Other than that, I'm working through Dr. Chuck's web-spider program, which has an application that pulls pages and the pages they link to from the web (and so on), then creates a database showing which pages link to which other pages -- and another application that uses that information to calculate page ranking.  I didn't understand much from the code, so I was researching a little bit about page ranking (algorithms?) -- in the course of which I found this page, which explains it in a very clear and interesting fashion.

The basic takeaway: a page's rank is a weighted sum of the ratios of the ranks of pages linking to it to the number of pages they link to.  To get a high rank, you either need lots of pages linking to you or you need very important pages linking to you exclusively.

I'll be looking at the spider app code and the page ranking code more closely in the next few days, I hope, and perhaps writing it up here if I have time.  All in all, I didn't get as much done as I would have liked to this weekend, but I am excited to have built a basic made-to-order web-scraper.  If nothing else, it will make it easier to research code camps.

Comments

Popular posts from this blog

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data). The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py  and geodump.py .  geoload.py  reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database ( geodata.db ).  geodump.py  reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through ...

Compiling and Executing Java Files With -cp

I decided I was going to "man up" and figure out how to compile a java program with an external dependency from the command line instead of relying on an IDE-- the DOS command line, to be more specific. I ran into a few problems: 1.  The external dependency was given to me as a java file.  I experimented compiling it as a .jar, but I wasn't sure how to import a class from a .jar, so I ended up compiling it into a class. 2.  When I tried to run the file, I got an error saying that the class had been compiled with a different version of Java than my JRE.  The Internet told me to check my path variable for Java.  It sure looked like it was pointing to the latest JRE (and the same version of Java as my compiler).  I asked the Internet again and found the following command: for %I in (java.exe) do @echo %~$PATH:I I'm not exactly sure what the syntax of that magic command is (intuitively it's returning the path that executes when I run the "java" com...

Quick Find / Quick Union (Connected Nodes)

Setup This week I learned about the "Quick Find" or "Quick Union" algorithm. Imagine an NxN grid of nodes, some of which are connected by lines. A connection can be interpreted as accessibility: if two nodes are connected, you can get from one to the other. Every node is accessible to itself: to get where you already are, stay there. Also, If you can get from A to B, you can go back from B to A. And if you can get from A to B and from B to C, then you can get from A to C. As a consequence, the connection between nodes divides the grid into regions of mutually accessible nodes. You can travel from any node in a given region to any other node in that region -- but not to any nodes outside that region (exercise to reader -- proof by contradiction). The problem has two parts. First, find a way to represent this grid structure and the accessibility relation; second, use your schema to efficiently calculate whether two given nodes are accessible to each other. ...