Part 0 – Intro
Part 1 – Game List
The gentle tingle of a wine-washed Tropicana bottle has encouraged me to finish the script that grabs Steam user data. That, and some unheeded Netflix telling my subconscious that I’m not drinking and coding alone on a Friday night. My regular-conscious, on the other hand, is fucking jazzed to be drinking and coding.
“To build a recommendation engine, one must be handed recommendations on a platter.” – Albert Ghandi Jesus
If you have a public profile, anyone can hit up at http://steamcommunity.com/id/{{your steam name}}/games/?tab=all
. Behold! All your games in a full, unpaginated request. Too good to be true? No. It’s better. One of the script tags on the page defines a global js variable called rgGames
, a full JSON array of games that you really don’t need to do shit to.
More beholding! A script that just pulls this JSON rgGames
object and then wastes some cycles for text output formatting because I don’t know what I’m doing with it yet. (Edit: Well, shit. It looks like steam has a public API for profiles. It requires some extra setup, but I want to be a good steward of their data.)
require 'open-uri' | |
require 'nokogiri' | |
require 'cgi' | |
require 'json' | |
# UserGameScrape | |
# | |
# Usage: | |
# ugs = UserGameScrape.new | |
# games_tsv_string = ugs.games( steam_game_name ) | |
# puts games_tsv_string | |
# | |
class UserGameScrape | |
def initialize | |
@baseurl = "http://steamcommunity.com/id/" | |
@gamepath = "/games/?tab=all" | |
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22" | |
@gamer = false | |
end | |
# games method accepts valid Steam user with public profile | |
# open fails if user is non-existent or invalid (yay no error checking!) | |
# returns a tab formatted list of results | |
def games( gamer ) | |
@gamer = gamer | |
doc_content = open("#{@baseurl}#{@gamer}#{@gamepath}", "User-Agent" => @user_agent) | |
@doc = Nokogiri::HTML( doc_content ) | |
results = self.reformat_games( self.get_json ) | |
results | |
end | |
# get_json performs the heavy lifting of this class | |
# It searches the script tags for the javascript variable "rgGames", | |
# which contains a json object of all the users' games. | |
# This json string is parsed and reformatted for output to text file | |
def get_json | |
json = @doc.css('head script') | |
json.each do |script| | |
body = script.content | |
pos = body.index('var rgGames = ') | |
if pos | |
pos += 'var rgGames = '.length | |
game_json, *others = body[pos, body.length].split(/;/) | |
return JSON.parse(game_json.strip).each | |
end | |
end | |
end | |
# reformat_games - loops through json array input and pulls out | |
# key data points for reformatted output | |
def reformat_games( json ) | |
formatted = [] | |
@row = 0 | |
json.each do |game| | |
formatted << self.format_next_row( game['appid'], game['name'], game['hours_forever'] ) | |
end | |
formatted | |
end | |
# Basically just joins arguments with tabs for later use. | |
def format_next_row( id, name, hours ) | |
@row = @row.next | |
name = name ? name.gsub(/[^a-zA-Z0-9 :\-\&\.,]/, '') : '' | |
"#{@row.to_s}\t#{id}\t#{hours}\t#{name}\n" | |
end | |
end | |
# | |
# Executing script (main) | |
# | |
ugs = UserGameScrape.new | |
# printing out games to be dumped to file | |
puts ugs.games( "johnnyfuchs" ) |
Dump that output to a text file and out get:
1 208480 480.5 Assassins Creed III | |
2 72850 279.7 The Elder Scrolls V: Skyrim | |
3 205100 273.1 Dishonored | |
4 35140 208.2 Batman: Arkham Asylum GOTY Edition | |
5 217790 71.5 Dogfight 1942 | |
6 8930 36.7 Sid Meiers Civilization V | |
7 24780 36.6 SimCity 4 Deluxe | |
8 620 17.2 Portal 2 | |
9 400 15.3 Portal | |
10 41500 8.8 Torchlight | |
11 208140 8.5 Endless Space | |
12 48000 1.7 LIMBO | |
13 220 1.6 Half-Life 2 | |
14 208600 1.4 Lunar Flight | |
15 92800 0.5 SpaceChem | |
16 105600 0.3 Terraria | |
17 8500 0 Eve Online: Inferno | |
18 340 0 Half-Life 2: Lost Coast |
There you have it, the sad, skewed report of my Steam game playtime. This miniscule and distorted data set isn’t going to do a very good job suggesting games. I might as well replace all this code with:
10 PRINT "Assassins Creed III"
20 GOTO 10
A good recommendation uses your previous preferences to infer what you will probably like, and you’ll probably like what other people who share your likes like. So we need to find people like you. Fortunately, I found some “volunteers” to start seeding a database with:
Thank you vibrant Steam community portal! The following script walks through discussion forum comment threads and pulls out usernames or profile ids.
require 'open-uri' | |
require 'nokogiri' | |
require 'cgi' | |
# UserSeeder class has one important method -> get_user | |
# all forum and discussion pagination should be automagic | |
class UserSeeder | |
def initialize | |
@forum = -1 # only access to new forum page increments before fetching, staring at -1 | |
@user_buffer = [] # user buffer holds the list of profile names | |
@topic_buffer = [] # holds a list of topics scrapped from each forum page | |
@forum_page = 0 # page index for a forum | |
@forum_pages = 0 # holds the number of pages for the current forum | |
@forum_page_size = 50 # max discussions per forum that Steam allows for a request | |
@forum_contents = "" # unparsed forum contents | |
# ajax endpoint that returns json plus HTML for page results | |
@forum_url = "http://steamcommunity.com/forum/4009259/General/render" | |
# html endpoint for first page of discussion forum | |
@discussion_url = "http://steamcommunity.com/discussions/forum/" | |
# chrome user agent | |
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22" | |
end | |
# next_forum increments the forum walks through a small int forum | |
# number until it finds one with a list of valid discussion topics | |
# it gives up after 10 iterations by default | |
def next_forum( attempts = 10 ) | |
@forum = @forum.next | |
unless attempts | |
abort("cannot find a new forum") | |
end | |
forum_content = open("#{@forum_url}/#{@forum}/?start=0&count=#{@forum_page_size}", "User-Agent" => @user_agent) | |
sleep(2) | |
@forum_contents = forum_content.read | |
@forum_pages = self.get_forum_page_count | |
if @forum_pages > 0 | |
@forum_page = 0 | |
else | |
attempts -= 1 | |
self.next_forum( attempts ) | |
end | |
end | |
# walk through all the "pages" of a discussion forum | |
# actually just a start point and a count | |
# if it's on the last page it will start ask for the next forum | |
def next_forum_page | |
if @forum_page < @forum_pages | |
start = @forum_page * @forum_page_size | |
forum_content = open("#{@forum_url}/#{@forum}/?start=#{start}&count=#{@forum_page_size}", "User-Agent" => @user_agent) | |
sleep(2) | |
@forum_contents = forum_content.read | |
@forum_page = @forum_page.next | |
else | |
self.next_forum | |
end | |
end | |
# filling the topic buffer parses the forum page for discussion ids. If | |
# it doesn't find any, it walks to the next page in forum and tries again. | |
def fill_topic_buffer | |
page_topics = @forum_contents.scan(/steamcommunity.com\\\/discussions\\\/forum\\\/[0-9]+\\\/([0-9]+)\\\//).flatten | |
if page_topics.length > 0 | |
@topic_buffer.concat( page_topics ) | |
else | |
self.next_forum_page | |
self.fill_topic_buffer | |
end | |
end | |
# get_user pulls a user from the buffer. If the buffer is empty | |
# it calls the method to fill it and tries again. | |
def get_user | |
if @user_buffer.length > 0 | |
return @user_buffer.shift | |
else | |
self.fill_user_buffer | |
return self.get_user | |
end | |
end | |
# The user buffer is filled by scraping a forum discussion topic | |
# for all commentors. If the topic buffer is empty, refill it. | |
def fill_user_buffer | |
if @topic_buffer.length > 0 | |
self.users_from_topic( @forum, @topic_buffer.shift ) | |
else | |
self.fill_topic_buffer | |
self.users_from_topic( @forum, @topic_buffer.shift ) | |
end | |
end | |
# users_from_topic opens a discussion thread and pulls out | |
# commentor and poster's usernames. | |
def users_from_topic( forum, topic ) | |
content = open("#{@discussion_url}/#{forum}/#{topic}/", "User-Agent" => @user_agent) | |
sleep(2) | |
doc = Nokogiri::HTML( content ) | |
doc.css('.commentthread_author_link, .forum_op_author').each do |link| | |
@user_buffer << link.attribute('href').content.split('/').last | |
end | |
end | |
# the forum page count is calculated by grabbig the total number | |
# of discussions in the forum, and dividing by the "page" size | |
# (just the max results per request) | |
def get_forum_page_count | |
pages = 0 | |
if @forum_contents.length > 10 | |
if match = @forum_contents.match(/"total_count"\s?:\s?(null|[0-9]+)/i) | |
count, null = match.captures | |
pages = count.to_i / @forum_page_size.to_f | |
pages = pages.ceil | |
end | |
end | |
return pages | |
end | |
end | |
# | |
# Executing script (main) | |
# | |
di = UserSeeder.new | |
while true | |
puts di.get_user | |
$stdout.flush | |
end |
There is a few second delay between network calls to not be a complete bitch to Steam’s servers, or flagged as an obvious bot, whatever. After a few minutes we are up to a few thousand profiles. The best part is seeing the same names appear dozens of times in a row. This isn’t Youtube guys, save the comment spam for your diary. So now we have a fairly lengthy list of usernames. ( Edit: I just did a little linux sort | uniq | wc -l
magic and got a whopping 252 unique users out of 30K scrapped names… which barely rounds up to 1% unique rate… REALLY? )
blog_part_1_5
of git://github.com/johnnyfuchs/steamredux.git
has the project code in the state of this post. Next post will be starting what the title promises, super fucking cool database stuff.