Building a Steam Recommendation Engine Part 1.5 – Users

Part 0 – Intro
Part 1 – Game List

The gentle tingle of a wine-washed Tropicana bottle has encouraged me to finish the script that grabs Steam user data. That, and some unheeded Netflix telling my subconscious that I’m not drinking and coding alone on a Friday night. My regular-conscious, on the other hand, is fucking jazzed to be drinking and coding.

“To build a recommendation engine, one must be handed recommendations on a platter.” – Albert Ghandi Jesus

If you have a public profile, anyone can hit up at http://steamcommunity.com/id/{{your steam name}}/games/?tab=all. Behold! All your games in a full, unpaginated request. Too good to be true? No. It’s better. One of the script tags on the page defines a global js variable called rgGames, a full JSON array of games that you really don’t need to do shit to.

More beholding! A script that just pulls this JSON rgGames object and then wastes some cycles for text output formatting because I don’t know what I’m doing with it yet. (Edit: Well, shit. It looks like steam has a public API for profiles. It requires some extra setup, but I want to be a good steward of their data.)

require 'open-uri'
require 'nokogiri'
require 'cgi'
require 'json'
# UserGameScrape
#
# Usage:
# ugs = UserGameScrape.new
# games_tsv_string = ugs.games( steam_game_name )
# puts games_tsv_string
#
class UserGameScrape
def initialize
@baseurl = "http://steamcommunity.com/id/"
@gamepath = "/games/?tab=all"
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"
@gamer = false
end
# games method accepts valid Steam user with public profile
# open fails if user is non-existent or invalid (yay no error checking!)
# returns a tab formatted list of results
def games( gamer )
@gamer = gamer
doc_content = open("#{@baseurl}#{@gamer}#{@gamepath}", "User-Agent" => @user_agent)
@doc = Nokogiri::HTML( doc_content )
results = self.reformat_games( self.get_json )
results
end
# get_json performs the heavy lifting of this class
# It searches the script tags for the javascript variable "rgGames",
# which contains a json object of all the users' games.
# This json string is parsed and reformatted for output to text file
def get_json
json = @doc.css('head script')
json.each do |script|
body = script.content
pos = body.index('var rgGames = ')
if pos
pos += 'var rgGames = '.length
game_json, *others = body[pos, body.length].split(/;/)
return JSON.parse(game_json.strip).each
end
end
end
# reformat_games - loops through json array input and pulls out
# key data points for reformatted output
def reformat_games( json )
formatted = []
@row = 0
json.each do |game|
formatted << self.format_next_row( game['appid'], game['name'], game['hours_forever'] )
end
formatted
end
# Basically just joins arguments with tabs for later use.
def format_next_row( id, name, hours )
@row = @row.next
name = name ? name.gsub(/[^a-zA-Z0-9 :\-\&\.,]/, '') : ''
"#{@row.to_s}\t#{id}\t#{hours}\t#{name}\n"
end
end
#
# Executing script (main)
#
ugs = UserGameScrape.new
# printing out games to be dumped to file
puts ugs.games( "johnnyfuchs" )

Dump that output to a text file and out get:

1 208480 480.5 Assassins Creed III
2 72850 279.7 The Elder Scrolls V: Skyrim
3 205100 273.1 Dishonored
4 35140 208.2 Batman: Arkham Asylum GOTY Edition
5 217790 71.5 Dogfight 1942
6 8930 36.7 Sid Meiers Civilization V
7 24780 36.6 SimCity 4 Deluxe
8 620 17.2 Portal 2
9 400 15.3 Portal
10 41500 8.8 Torchlight
11 208140 8.5 Endless Space
12 48000 1.7 LIMBO
13 220 1.6 Half-Life 2
14 208600 1.4 Lunar Flight
15 92800 0.5 SpaceChem
16 105600 0.3 Terraria
17 8500 0 Eve Online: Inferno
18 340 0 Half-Life 2: Lost Coast
view raw johnnyfuchs.txt hosted with ❤ by GitHub

There you have it, the sad, skewed report of my Steam game playtime. This miniscule and distorted data set isn’t going to do a very good job suggesting games. I might as well replace all this code with:

10 PRINT "Assassins Creed III"
20 GOTO 10

A good recommendation uses your previous preferences to infer what you will probably like, and you’ll probably like what other people who share your likes like. So we need to find people like you. Fortunately, I found some “volunteers” to start seeding a database with:

Steam Player Community

Thank you vibrant Steam community portal! The following script walks through discussion forum comment threads and pulls out usernames or profile ids.

require 'open-uri'
require 'nokogiri'
require 'cgi'
# UserSeeder class has one important method -> get_user
# all forum and discussion pagination should be automagic
class UserSeeder
def initialize
@forum = -1 # only access to new forum page increments before fetching, staring at -1
@user_buffer = [] # user buffer holds the list of profile names
@topic_buffer = [] # holds a list of topics scrapped from each forum page
@forum_page = 0 # page index for a forum
@forum_pages = 0 # holds the number of pages for the current forum
@forum_page_size = 50 # max discussions per forum that Steam allows for a request
@forum_contents = "" # unparsed forum contents
# ajax endpoint that returns json plus HTML for page results
@forum_url = "http://steamcommunity.com/forum/4009259/General/render&quot;
# html endpoint for first page of discussion forum
@discussion_url = "http://steamcommunity.com/discussions/forum/&quot;
# chrome user agent
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"
end
# next_forum increments the forum walks through a small int forum
# number until it finds one with a list of valid discussion topics
# it gives up after 10 iterations by default
def next_forum( attempts = 10 )
@forum = @forum.next
unless attempts
abort("cannot find a new forum")
end
forum_content = open("#{@forum_url}/#{@forum}/?start=0&count=#{@forum_page_size}", "User-Agent" => @user_agent)
sleep(2)
@forum_contents = forum_content.read
@forum_pages = self.get_forum_page_count
if @forum_pages > 0
@forum_page = 0
else
attempts -= 1
self.next_forum( attempts )
end
end
# walk through all the "pages" of a discussion forum
# actually just a start point and a count
# if it's on the last page it will start ask for the next forum
def next_forum_page
if @forum_page < @forum_pages
start = @forum_page * @forum_page_size
forum_content = open("#{@forum_url}/#{@forum}/?start=#{start}&count=#{@forum_page_size}", "User-Agent" => @user_agent)
sleep(2)
@forum_contents = forum_content.read
@forum_page = @forum_page.next
else
self.next_forum
end
end
# filling the topic buffer parses the forum page for discussion ids. If
# it doesn't find any, it walks to the next page in forum and tries again.
def fill_topic_buffer
page_topics = @forum_contents.scan(/steamcommunity.com\\\/discussions\\\/forum\\\/[0-9]+\\\/([0-9]+)\\\//).flatten
if page_topics.length > 0
@topic_buffer.concat( page_topics )
else
self.next_forum_page
self.fill_topic_buffer
end
end
# get_user pulls a user from the buffer. If the buffer is empty
# it calls the method to fill it and tries again.
def get_user
if @user_buffer.length > 0
return @user_buffer.shift
else
self.fill_user_buffer
return self.get_user
end
end
# The user buffer is filled by scraping a forum discussion topic
# for all commentors. If the topic buffer is empty, refill it.
def fill_user_buffer
if @topic_buffer.length > 0
self.users_from_topic( @forum, @topic_buffer.shift )
else
self.fill_topic_buffer
self.users_from_topic( @forum, @topic_buffer.shift )
end
end
# users_from_topic opens a discussion thread and pulls out
# commentor and poster's usernames.
def users_from_topic( forum, topic )
content = open("#{@discussion_url}/#{forum}/#{topic}/", "User-Agent" => @user_agent)
sleep(2)
doc = Nokogiri::HTML( content )
doc.css('.commentthread_author_link, .forum_op_author').each do |link|
@user_buffer << link.attribute('href').content.split('/').last
end
end
# the forum page count is calculated by grabbig the total number
# of discussions in the forum, and dividing by the "page" size
# (just the max results per request)
def get_forum_page_count
pages = 0
if @forum_contents.length > 10
if match = @forum_contents.match(/"total_count"\s?:\s?(null|[0-9]+)/i)
count, null = match.captures
pages = count.to_i / @forum_page_size.to_f
pages = pages.ceil
end
end
return pages
end
end
#
# Executing script (main)
#
di = UserSeeder.new
while true
puts di.get_user
$stdout.flush
end
view raw UserSeeder.rb hosted with ❤ by GitHub

There is a few second delay between network calls to not be a complete bitch to Steam’s servers, or flagged as an obvious bot, whatever. After a few minutes we are up to a few thousand profiles. The best part is seeing the same names appear dozens of times in a row. This isn’t Youtube guys, save the comment spam for your diary. So now we have a fairly lengthy list of usernames. ( Edit: I just did a little linux sort | uniq | wc -l magic and got a whopping 252 unique users out of 30K scrapped names… which barely rounds up to 1% unique rate… REALLY? )

blog_part_1_5 of git://github.com/johnnyfuchs/steamredux.git has the project code in the state of this post. Next post will be starting what the title promises, super fucking cool database stuff.

Standard

Building a Steam Recommendation Engine Part 1 – Steam Game List

Part 0 – Intro

On a semi-related note, I registered the domain “steamredux.com” for this project. Apparently, redux is Latin for “revived from the dead”. That means SteamRedux makes all of zero sense. Aaaaaaand there goes ten bucks that could have bought me a six pack of something 6.7%.

Googling for a list of Steam games got me nowhere. I landed on the Steam search page which contained a promising little footer showing 1 - 25 of 6691, indicating that I could just walk through the pages to find every Steam game available. This revelation was “the hard part” of creating a recommendation engine, probably.

My day job is writing javascript with jQuery for DOM manipulation, and my Github account has a couple ignored nodejs projects. This technology stack is perfect for screen scraping. Instead, the code here uses Ruby and the Nokogiri gem. I don’t know Ruby at all (or Latin it seems). Ruby has a bunch of pimp features that I’m not using. Please gloss over the completely synchronous network calls and unclear code, it’s not a reflection of the language, just my recovery Sunday laziness. Even my dog is exhausted from a big night chewing bones. (Something something your mom)

Pup Taking a Nap

Okay, brass tacs. The Steam search page http://store.steampowered.com/search/ ajaxes in full blocks of HTML from http://store.steampowered.com/search/results with a “page” GET parameter. In that page is an empty div full of blocks that look like this:

<a class="search_result_row even" href="http://store.steampowered.com/app/211160/?snr=1_7_7_230_150_24"&gt;
<div class="col search_price"> &#36;14.99 </div>
<div class="col search_type">
<img src="http://cdn2.store.steampowered.com/public/images/ico/ico_type_app.gif"&gt;
</div>
<div class="col search_metascore"></div>
<div class="col search_released"> Oct 17, 2012 </div>
<div class="col search_capsule">
<img src="http://cdn2.steampowered.com/v/gfx/apps/211160/capsule_sm_120.jpg?t=1350669009&quot; alt="Buy Viking: Battle for Asgard" width="120" height="45">
</div>
<div class="col search_name ellipsis">
<h4>Viking: Battle for Asgard</h4>
<p>
<img class="platform_img" src="http://cdn2.store.steampowered.com/public/images/v5/platforms/platform_win.png&quot; width="22" height="22">
Action, Adventure - Released: Oct 17, 2012
</p>
</div>
<div style="clear: both;"></div>
</a>
view raw search_row.html hosted with ❤ by GitHub

The highly sophisticated code below is what fetches all the games, and parses out the content from the above block of HTML a few thousand times. Thank you computer. Comments say what each part does because making 8 Gists sounds worse than a hangover.

require 'open-uri'
require 'nokogiri'
require 'cgi'
class SteamGameStrape
def initialize
@page = 1
@url = "http://store.steampowered.com/search/results&quot;
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"
@row = 0
@doc
@ele
end
# Increment the requested page number.
# Save network result to a Nokigiri HTML document.
# Return formatted results (game id, name, type, etc)
def next_page
@page = @page.next
doc_content = open("#{@url}?page=#{@page}", "User-Agent" => @user_agent)
@doc = Nokogiri::HTML( doc_content )
page_results = self.parse_page
page_results
end
# Finds all the blocks that match .search_result_row using a css selector.
# Set a row and element reference to be used by parsing functions.
# Return array of parsed results.
def parse_page
rows = @doc.css('.search_result_row')
results = []
rows.each do |ele|
@row = @row.next
@ele = ele
id = self.game_id
if id
results << self.format_next_row( id, self.game_price, self.game_name, self.game_type )
end
end
results
end
# Basically just joins arguments with tabs for later use.
def format_next_row( id, price, name, type)
"#{@row.to_s}\t#{id}\t#{type}\t#{price}\t#{name}"
end
# The "game_type" is really "content_type" as search results also include guides, mods, etc.
# The only place that has this information is the icon for the game type,
# fortunately it's semantically named.
def game_type
img_ele = @ele.css('.search_type img')
if img_ele && img_ele.length && img_ele.first.attribute('src')
type = img_ele.first.attribute('src').content.scan(/(app|dlc|vid|mod|guide)/).first.first
else
type = '???'
end
type
end
# The game id is Steam's external (maybe internally too?).
# This is pulled from the permalink to the game page.
def game_id
app = @ele.attribute('href').content.scan(/app\/[0-9]+/)
id = app.first ? app.first.gsub(/app\//,'') : nil
id
end
# Game name is scrapped from the h4 tag.
# Steam's emplyees are totally using XP or something, cuz ISO characters are everywhere.
def game_name
raw_name = @ele.css('.search_name h4').first
# Killing special characters, Encoding::ISO_8859_1 was a pain to convert to UTF-8 versions.
raw_name ? raw_name.content.gsub(/[\s]+/, " ").gsub(/[^a-zA-Z0-9 :\-\&\.,]/, '') : ''
end
# Parse out the game price... for some reason.
def game_price
raw_price = @ele.css('.search_price').first.content.split('$').last
raw_price ? raw_price : 'Free to Play'
end
end
#
# Executing script (main)
#
ss = SteamGameStrape.new
# As long as results come back, keep fetching new pages.
# This needs _much_ more robust handling.
while res = ss.next_page do
res.each do |row|
puts row
end
# Stall execution for five seconds to be nice to Steam servers.
sleep( 5 )
end

If you are only here to rip off a list of games, open this up in excel. Otherwise clone it like you know what you’re doing.

git://github.com/johnnyfuchs/steamredux.git

Branch “blog_part_1” will have the code for this post. Next week I might go down the rabbit whole of putting these in a Neo4j database. I also might be setting up the script to scrape user data. Whichever I’m least likely to not do.

Standard

Building a Steam Recommendation Engine Part 0

I thought it’d be fun to make a Steam game rating site. But apparently (and expectedly) that exists. Instead, I plan to figure out how graph databases work. This was not inspired by Facebook’s graph search, because I can’t possibly see how my “friends” restaurant recommendations could be more valuable than, you know, an aggregate of everyone’s or experts’. Nor was it inspired by the amazing ability to see which of my friends of friends like dogs or cheese or both. Spoiler: all of them.

A little more inspiring is the use of graph databases in studying gene interaction, illness diagnosis, path finding, and to the point, a video game recommendation engine.

Mostly to keep me focused, I’ll be chronicling the progress of putting this loosely planned project together. We (royal ‘we’, not author and readers ‘we’) are going to need a few things to make this happen.

  1. A list of Steam games. Just scraping these.
  2. Some Steam users and the lists of the games that they play or like. I dunno yet. Maybe I can scrape these too.
  3. A graph database! I’m going to use Neo4j because I heard a talk on it and Heroku has a fee addon. Plus, I trust companies of Swedish origin. Right? IKEA?
  4. A little web site to, you know, recommend stuff to people. I’ll do that last, or part way.

Four things? Nice project management, we. Next post I’ll cover part one, because I already wrote that.

Standard