Building a Steam Recommendation Engine Part 1.5 – Users

Part 0 – Intro
Part 1 – Game List

The gentle tingle of a wine-washed Tropicana bottle has encouraged me to finish the script that grabs Steam user data. That, and some unheeded Netflix telling my subconscious that I’m not drinking and coding alone on a Friday night. My regular-conscious, on the other hand, is fucking jazzed to be drinking and coding.

“To build a recommendation engine, one must be handed recommendations on a platter.” – Albert Ghandi Jesus

If you have a public profile, anyone can hit up at http://steamcommunity.com/id/{{your steam name}}/games/?tab=all. Behold! All your games in a full, unpaginated request. Too good to be true? No. It’s better. One of the script tags on the page defines a global js variable called rgGames, a full JSON array of games that you really don’t need to do shit to.

More beholding! A script that just pulls this JSON rgGames object and then wastes some cycles for text output formatting because I don’t know what I’m doing with it yet. (Edit: Well, shit. It looks like steam has a public API for profiles. It requires some extra setup, but I want to be a good steward of their data.)

require 'open-uri'
require 'nokogiri'
require 'cgi'
require 'json'
# UserGameScrape
#
# Usage:
# ugs = UserGameScrape.new
# games_tsv_string = ugs.games( steam_game_name )
# puts games_tsv_string
#
class UserGameScrape
def initialize
@baseurl = "http://steamcommunity.com/id/"
@gamepath = "/games/?tab=all"
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"
@gamer = false
end
# games method accepts valid Steam user with public profile
# open fails if user is non-existent or invalid (yay no error checking!)
# returns a tab formatted list of results
def games( gamer )
@gamer = gamer
doc_content = open("#{@baseurl}#{@gamer}#{@gamepath}", "User-Agent" => @user_agent)
@doc = Nokogiri::HTML( doc_content )
results = self.reformat_games( self.get_json )
results
end
# get_json performs the heavy lifting of this class
# It searches the script tags for the javascript variable "rgGames",
# which contains a json object of all the users' games.
# This json string is parsed and reformatted for output to text file
def get_json
json = @doc.css('head script')
json.each do |script|
body = script.content
pos = body.index('var rgGames = ')
if pos
pos += 'var rgGames = '.length
game_json, *others = body[pos, body.length].split(/;/)
return JSON.parse(game_json.strip).each
end
end
end
# reformat_games - loops through json array input and pulls out
# key data points for reformatted output
def reformat_games( json )
formatted = []
@row = 0
json.each do |game|
formatted << self.format_next_row( game['appid'], game['name'], game['hours_forever'] )
end
formatted
end
# Basically just joins arguments with tabs for later use.
def format_next_row( id, name, hours )
@row = @row.next
name = name ? name.gsub(/[^a-zA-Z0-9 :\-\&\.,]/, '') : ''
"#{@row.to_s}\t#{id}\t#{hours}\t#{name}\n"
end
end
#
# Executing script (main)
#
ugs = UserGameScrape.new
# printing out games to be dumped to file
puts ugs.games( "johnnyfuchs" )

Dump that output to a text file and out get:

1 208480 480.5 Assassins Creed III
2 72850 279.7 The Elder Scrolls V: Skyrim
3 205100 273.1 Dishonored
4 35140 208.2 Batman: Arkham Asylum GOTY Edition
5 217790 71.5 Dogfight 1942
6 8930 36.7 Sid Meiers Civilization V
7 24780 36.6 SimCity 4 Deluxe
8 620 17.2 Portal 2
9 400 15.3 Portal
10 41500 8.8 Torchlight
11 208140 8.5 Endless Space
12 48000 1.7 LIMBO
13 220 1.6 Half-Life 2
14 208600 1.4 Lunar Flight
15 92800 0.5 SpaceChem
16 105600 0.3 Terraria
17 8500 0 Eve Online: Inferno
18 340 0 Half-Life 2: Lost Coast
view raw johnnyfuchs.txt hosted with ❤ by GitHub

There you have it, the sad, skewed report of my Steam game playtime. This miniscule and distorted data set isn’t going to do a very good job suggesting games. I might as well replace all this code with:

10 PRINT "Assassins Creed III"
20 GOTO 10

A good recommendation uses your previous preferences to infer what you will probably like, and you’ll probably like what other people who share your likes like. So we need to find people like you. Fortunately, I found some “volunteers” to start seeding a database with:

Steam Player Community

Thank you vibrant Steam community portal! The following script walks through discussion forum comment threads and pulls out usernames or profile ids.

require 'open-uri'
require 'nokogiri'
require 'cgi'
# UserSeeder class has one important method -> get_user
# all forum and discussion pagination should be automagic
class UserSeeder
def initialize
@forum = -1 # only access to new forum page increments before fetching, staring at -1
@user_buffer = [] # user buffer holds the list of profile names
@topic_buffer = [] # holds a list of topics scrapped from each forum page
@forum_page = 0 # page index for a forum
@forum_pages = 0 # holds the number of pages for the current forum
@forum_page_size = 50 # max discussions per forum that Steam allows for a request
@forum_contents = "" # unparsed forum contents
# ajax endpoint that returns json plus HTML for page results
@forum_url = "http://steamcommunity.com/forum/4009259/General/render&quot;
# html endpoint for first page of discussion forum
@discussion_url = "http://steamcommunity.com/discussions/forum/&quot;
# chrome user agent
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"
end
# next_forum increments the forum walks through a small int forum
# number until it finds one with a list of valid discussion topics
# it gives up after 10 iterations by default
def next_forum( attempts = 10 )
@forum = @forum.next
unless attempts
abort("cannot find a new forum")
end
forum_content = open("#{@forum_url}/#{@forum}/?start=0&count=#{@forum_page_size}", "User-Agent" => @user_agent)
sleep(2)
@forum_contents = forum_content.read
@forum_pages = self.get_forum_page_count
if @forum_pages > 0
@forum_page = 0
else
attempts -= 1
self.next_forum( attempts )
end
end
# walk through all the "pages" of a discussion forum
# actually just a start point and a count
# if it's on the last page it will start ask for the next forum
def next_forum_page
if @forum_page < @forum_pages
start = @forum_page * @forum_page_size
forum_content = open("#{@forum_url}/#{@forum}/?start=#{start}&count=#{@forum_page_size}", "User-Agent" => @user_agent)
sleep(2)
@forum_contents = forum_content.read
@forum_page = @forum_page.next
else
self.next_forum
end
end
# filling the topic buffer parses the forum page for discussion ids. If
# it doesn't find any, it walks to the next page in forum and tries again.
def fill_topic_buffer
page_topics = @forum_contents.scan(/steamcommunity.com\\\/discussions\\\/forum\\\/[0-9]+\\\/([0-9]+)\\\//).flatten
if page_topics.length > 0
@topic_buffer.concat( page_topics )
else
self.next_forum_page
self.fill_topic_buffer
end
end
# get_user pulls a user from the buffer. If the buffer is empty
# it calls the method to fill it and tries again.
def get_user
if @user_buffer.length > 0
return @user_buffer.shift
else
self.fill_user_buffer
return self.get_user
end
end
# The user buffer is filled by scraping a forum discussion topic
# for all commentors. If the topic buffer is empty, refill it.
def fill_user_buffer
if @topic_buffer.length > 0
self.users_from_topic( @forum, @topic_buffer.shift )
else
self.fill_topic_buffer
self.users_from_topic( @forum, @topic_buffer.shift )
end
end
# users_from_topic opens a discussion thread and pulls out
# commentor and poster's usernames.
def users_from_topic( forum, topic )
content = open("#{@discussion_url}/#{forum}/#{topic}/", "User-Agent" => @user_agent)
sleep(2)
doc = Nokogiri::HTML( content )
doc.css('.commentthread_author_link, .forum_op_author').each do |link|
@user_buffer << link.attribute('href').content.split('/').last
end
end
# the forum page count is calculated by grabbig the total number
# of discussions in the forum, and dividing by the "page" size
# (just the max results per request)
def get_forum_page_count
pages = 0
if @forum_contents.length > 10
if match = @forum_contents.match(/"total_count"\s?:\s?(null|[0-9]+)/i)
count, null = match.captures
pages = count.to_i / @forum_page_size.to_f
pages = pages.ceil
end
end
return pages
end
end
#
# Executing script (main)
#
di = UserSeeder.new
while true
puts di.get_user
$stdout.flush
end
view raw UserSeeder.rb hosted with ❤ by GitHub

There is a few second delay between network calls to not be a complete bitch to Steam’s servers, or flagged as an obvious bot, whatever. After a few minutes we are up to a few thousand profiles. The best part is seeing the same names appear dozens of times in a row. This isn’t Youtube guys, save the comment spam for your diary. So now we have a fairly lengthy list of usernames. ( Edit: I just did a little linux sort | uniq | wc -l magic and got a whopping 252 unique users out of 30K scrapped names… which barely rounds up to 1% unique rate… REALLY? )

blog_part_1_5 of git://github.com/johnnyfuchs/steamredux.git has the project code in the state of this post. Next post will be starting what the title promises, super fucking cool database stuff.

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s