Skip to main content

Databased

I decided to add code to store the results from my scraper into a database instead of just printing them on the screen.  I also added a bit to capture the full review text as well as the title.  In the process, I learned a few things about SQL calls:

  1. You can't use WHERE with INSERT INTO
  2. If you want to add something to an existing row, you have to use UPDATE...WHERE

I spent a lot of time trying to insert the review into the same row as its associated title.  At first, I was creating a new row for each review.  Then I tried to link the review with my i counter and insert it into the i-th row.  But that went nowhere because (A) the database wasn't associating any numbers with the rows and (B) i counts review pages.  I fumbled around a lot, got my count INTEGER NOT NULL PRIMARY KEY (probably overkill -- I wanted to AUTOINCREMENT too, but I guess I was putting in the wrong command), and then after watching dubiously as the database kept matching negative titles with positive reviews, finally realized I needed to put the review with the entry count and not the page count.

One problem with the app as it's currently set up is that the association between title and review is happen-stance -- or, no information is saved telling me, "This is the n-th review stored, this is the n-th title stored -- so the n-th review and the n-th title belong together."

Tonight I was experimenting a little bit to see if I could get information about ratings for each review.  Unfortunately, my HTML source gives ratings as a list of something like <div class="icon-full_star">'s.  I was actually able to print out all the text containing those tags and, I think, create a list of the tags I wanted using regex -- but that gave me the total stars among all the ratings -- not granular enough.  So I found a website I can review later that might help me get better at using Beautiful Soup.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import sqlite3
import re

conn = sqlite3.connect('reviews.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS Reviews')
cur.execute('CREATE TABLE Reviews (count INTEGER NOT NULL PRIMARY KEY, title TEXT, review TEXT)')

i=1
t=1
while True:
    url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
    page = urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    titles = soup.find_all('div', attrs={'class': 'review-title'})
    reviews = soup.find_all('div', attrs={'class': 'body'})
    if reviews != []:
        for review in reviews:
            cur.execute("INSERT INTO Reviews (review) VALUES (?)", (review.text,))
            conn.commit()
    if titles != []:
        for title in titles:
            cur.execute('UPDATE Reviews SET title=? WHERE count=?', (title.text,t))
            t += 1
            conn.commit()
    else:
        print("No more data.")
        break
    i += 1

Other than that, I'm working through Dr. Chuck's web-spider program, which has an application that pulls pages and the pages they link to from the web (and so on), then creates a database showing which pages link to which other pages -- and another application that uses that information to calculate page ranking.  I didn't understand much from the code, so I was researching a little bit about page ranking (algorithms?) -- in the course of which I found this page, which explains it in a very clear and interesting fashion.

The basic takeaway: a page's rank is a weighted sum of the ratios of the ranks of pages linking to it to the number of pages they link to.  To get a high rank, you either need lots of pages linking to you or you need very important pages linking to you exclusively.

I'll be looking at the spider app code and the page ranking code more closely in the next few days, I hope, and perhaps writing it up here if I have time.  All in all, I didn't get as much done as I would have liked to this weekend, but I am excited to have built a basic made-to-order web-scraper.  If nothing else, it will make it easier to research code camps.

Comments

Popular posts from this blog

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data). The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py  and geodump.py .  geoload.py  reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database ( geodata.db ).  geodump.py  reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through ...

The Jump Algorithm

Meetup Went to a Coding Whiteboard Meetup tonight.  It was pretty great.  One of the leaders was even a CS master's student.  At first, honestly, I felt a little bit frustrated, especially because everyone around me seemed to be using pretty high level concepts / approaches that I wasn't familiar with.  But I found someone and relentlessly talked him through his approach until we both kind of realized there were issues in the problem we hadn't worked out yet.  I guess it just reinforces my feeling that when something seems too difficult, if you can, you need to find someone and force him to explain it to you in terms you can understand.  If the people around you really understand what they're about, they will have no problem and you'll learn a lot (assuming they're patient, I guess).  If they don't, you'll realize you aren't as alone as you thought you were.  Bit of the old Socrates. Problem So imagine you have an array with a bunch of numbe...

Shell Sort

Today I spent a little bit of time researching the "Shell" sort.  I wanted to post a few notes about the Princeton Algorithms Course's implementation to help me solidify my understanding. First, a little tidbit.  When I first heard about this algorithm, I thought it had something to do with shell games.  Turns out a man named Donald Shell discovered this method of sorting, whence the name. The Algorithms  book gives the following explanation (Sedgewick and Wayne,  Algorithms, 4th ed., p. 258): The idea is to rearrange the array to give it the property that taking every hth entry (starting anywhere) yields a sorted subsequence. Such an array is said to be h-sorted. Put another way, an h-sorted array is h independent sorted subsequences, interleaved together. By h-sorting for some large values of h, we can move items in the array long distances and thus make it easier to h-sort for smaller values of h. Using such a procedure for any sequence of values o...