I decided to add code to store the results from my scraper into a database instead of just printing them on the screen. I also added a bit to capture the full review text as well as the title. In the process, I learned a few things about SQL calls:
I spent a lot of time trying to insert the review into the same row as its associated title. At first, I was creating a new row for each review. Then I tried to link the review with my i counter and insert it into the i-th row. But that went nowhere because (A) the database wasn't associating any numbers with the rows and (B) i counts review pages. I fumbled around a lot, got my count INTEGER NOT NULL PRIMARY KEY (probably overkill -- I wanted to AUTOINCREMENT too, but I guess I was putting in the wrong command), and then after watching dubiously as the database kept matching negative titles with positive reviews, finally realized I needed to put the review with the entry count and not the page count.
One problem with the app as it's currently set up is that the association between title and review is happen-stance -- or, no information is saved telling me, "This is the n-th review stored, this is the n-th title stored -- so the n-th review and the n-th title belong together."
Tonight I was experimenting a little bit to see if I could get information about ratings for each review. Unfortunately, my HTML source gives ratings as a list of something like <div class="icon-full_star">'s. I was actually able to print out all the text containing those tags and, I think, create a list of the tags I wanted using regex -- but that gave me the total stars among all the ratings -- not granular enough. So I found a website I can review later that might help me get better at using Beautiful Soup.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sqlite3
import re
conn = sqlite3.connect('reviews.sqlite')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Reviews')
cur.execute('CREATE TABLE Reviews (count INTEGER NOT NULL PRIMARY KEY, title TEXT, review TEXT)')
i=1
t=1
while True:
url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
titles = soup.find_all('div', attrs={'class': 'review-title'})
reviews = soup.find_all('div', attrs={'class': 'body'})
if reviews != []:
for review in reviews:
cur.execute("INSERT INTO Reviews (review) VALUES (?)", (review.text,))
conn.commit()
if titles != []:
for title in titles:
cur.execute('UPDATE Reviews SET title=? WHERE count=?', (title.text,t))
t += 1
conn.commit()
else:
print("No more data.")
break
i += 1
Other than that, I'm working through Dr. Chuck's web-spider program, which has an application that pulls pages and the pages they link to from the web (and so on), then creates a database showing which pages link to which other pages -- and another application that uses that information to calculate page ranking. I didn't understand much from the code, so I was researching a little bit about page ranking (algorithms?) -- in the course of which I found this page, which explains it in a very clear and interesting fashion.
The basic takeaway: a page's rank is a weighted sum of the ratios of the ranks of pages linking to it to the number of pages they link to. To get a high rank, you either need lots of pages linking to you or you need very important pages linking to you exclusively.
I'll be looking at the spider app code and the page ranking code more closely in the next few days, I hope, and perhaps writing it up here if I have time. All in all, I didn't get as much done as I would have liked to this weekend, but I am excited to have built a basic made-to-order web-scraper. If nothing else, it will make it easier to research code camps.
- You can't use WHERE with INSERT INTO
- If you want to add something to an existing row, you have to use UPDATE...WHERE
I spent a lot of time trying to insert the review into the same row as its associated title. At first, I was creating a new row for each review. Then I tried to link the review with my i counter and insert it into the i-th row. But that went nowhere because (A) the database wasn't associating any numbers with the rows and (B) i counts review pages. I fumbled around a lot, got my count INTEGER NOT NULL PRIMARY KEY (probably overkill -- I wanted to AUTOINCREMENT too, but I guess I was putting in the wrong command), and then after watching dubiously as the database kept matching negative titles with positive reviews, finally realized I needed to put the review with the entry count and not the page count.
One problem with the app as it's currently set up is that the association between title and review is happen-stance -- or, no information is saved telling me, "This is the n-th review stored, this is the n-th title stored -- so the n-th review and the n-th title belong together."
Tonight I was experimenting a little bit to see if I could get information about ratings for each review. Unfortunately, my HTML source gives ratings as a list of something like <div class="icon-full_star">'s. I was actually able to print out all the text containing those tags and, I think, create a list of the tags I wanted using regex -- but that gave me the total stars among all the ratings -- not granular enough. So I found a website I can review later that might help me get better at using Beautiful Soup.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sqlite3
import re
conn = sqlite3.connect('reviews.sqlite')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Reviews')
cur.execute('CREATE TABLE Reviews (count INTEGER NOT NULL PRIMARY KEY, title TEXT, review TEXT)')
i=1
t=1
while True:
url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
titles = soup.find_all('div', attrs={'class': 'review-title'})
reviews = soup.find_all('div', attrs={'class': 'body'})
if reviews != []:
for review in reviews:
cur.execute("INSERT INTO Reviews (review) VALUES (?)", (review.text,))
conn.commit()
if titles != []:
for title in titles:
cur.execute('UPDATE Reviews SET title=? WHERE count=?', (title.text,t))
t += 1
conn.commit()
else:
print("No more data.")
break
i += 1
Other than that, I'm working through Dr. Chuck's web-spider program, which has an application that pulls pages and the pages they link to from the web (and so on), then creates a database showing which pages link to which other pages -- and another application that uses that information to calculate page ranking. I didn't understand much from the code, so I was researching a little bit about page ranking (algorithms?) -- in the course of which I found this page, which explains it in a very clear and interesting fashion.
The basic takeaway: a page's rank is a weighted sum of the ratios of the ranks of pages linking to it to the number of pages they link to. To get a high rank, you either need lots of pages linking to you or you need very important pages linking to you exclusively.
I'll be looking at the spider app code and the page ranking code more closely in the next few days, I hope, and perhaps writing it up here if I have time. All in all, I didn't get as much done as I would have liked to this weekend, but I am excited to have built a basic made-to-order web-scraper. If nothing else, it will make it easier to research code camps.
Comments
Post a Comment