Building a Steam Recommendation Engine Part 1 – Steam Game List

Part 0 – Intro

On a semi-related note, I registered the domain “steamredux.com” for this project. Apparently, redux is Latin for “revived from the dead”. That means SteamRedux makes all of zero sense. Aaaaaaand there goes ten bucks that could have bought me a six pack of something 6.7%.

Googling for a list of Steam games got me nowhere. I landed on the Steam search page which contained a promising little footer showing 1 - 25 of 6691, indicating that I could just walk through the pages to find every Steam game available. This revelation was “the hard part” of creating a recommendation engine, probably.

My day job is writing javascript with jQuery for DOM manipulation, and my Github account has a couple ignored nodejs projects. This technology stack is perfect for screen scraping. Instead, the code here uses Ruby and the Nokogiri gem. I don’t know Ruby at all (or Latin it seems). Ruby has a bunch of pimp features that I’m not using. Please gloss over the completely synchronous network calls and unclear code, it’s not a reflection of the language, just my recovery Sunday laziness. Even my dog is exhausted from a big night chewing bones. (Something something your mom)

Pup Taking a Nap

Okay, brass tacs. The Steam search page http://store.steampowered.com/search/ ajaxes in full blocks of HTML from http://store.steampowered.com/search/results with a “page” GET parameter. In that page is an empty div full of blocks that look like this:

<a class="search_result_row even" href="http://store.steampowered.com/app/211160/?snr=1_7_7_230_150_24">
<div class="col search_price"> &#36;14.99 </div>
<div class="col search_type">
<img src="http://cdn2.store.steampowered.com/public/images/ico/ico_type_app.gif">
</div>
<div class="col search_metascore"></div>
<div class="col search_released"> Oct 17, 2012 </div>
<div class="col search_capsule">
<img src="http://cdn2.steampowered.com/v/gfx/apps/211160/capsule_sm_120.jpg?t=1350669009" alt="Buy Viking: Battle for Asgard" width="120" height="45">
</div>
<div class="col search_name ellipsis">
<h4>Viking: Battle for Asgard</h4>
<p>
<img class="platform_img" src="http://cdn2.store.steampowered.com/public/images/v5/platforms/platform_win.png" width="22" height="22">
Action, Adventure - Released: Oct 17, 2012
</p>
</div>
<div style="clear: both;"></div>
</a>
view raw search_row.html hosted with ❤ by GitHub

The highly sophisticated code below is what fetches all the games, and parses out the content from the above block of HTML a few thousand times. Thank you computer. Comments say what each part does because making 8 Gists sounds worse than a hangover.

require 'open-uri'
require 'nokogiri'
require 'cgi'
class SteamGameStrape
def initialize
@page = 1
@url = "http://store.steampowered.com/search/results&quot;
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"
@row = 0
@doc
@ele
end
# Increment the requested page number.
# Save network result to a Nokigiri HTML document.
# Return formatted results (game id, name, type, etc)
def next_page
@page = @page.next
doc_content = open("#{@url}?page=#{@page}", "User-Agent" => @user_agent)
@doc = Nokogiri::HTML( doc_content )
page_results = self.parse_page
page_results
end
# Finds all the blocks that match .search_result_row using a css selector.
# Set a row and element reference to be used by parsing functions.
# Return array of parsed results.
def parse_page
rows = @doc.css('.search_result_row')
results = []
rows.each do |ele|
@row = @row.next
@ele = ele
id = self.game_id
if id
results << self.format_next_row( id, self.game_price, self.game_name, self.game_type )
end
end
results
end
# Basically just joins arguments with tabs for later use.
def format_next_row( id, price, name, type)
"#{@row.to_s}\t#{id}\t#{type}\t#{price}\t#{name}"
end
# The "game_type" is really "content_type" as search results also include guides, mods, etc.
# The only place that has this information is the icon for the game type,
# fortunately it's semantically named.
def game_type
img_ele = @ele.css('.search_type img')
if img_ele && img_ele.length && img_ele.first.attribute('src')
type = img_ele.first.attribute('src').content.scan(/(app|dlc|vid|mod|guide)/).first.first
else
type = '???'
end
type
end
# The game id is Steam's external (maybe internally too?).
# This is pulled from the permalink to the game page.
def game_id
app = @ele.attribute('href').content.scan(/app\/[0-9]+/)
id = app.first ? app.first.gsub(/app\//,'') : nil
id
end
# Game name is scrapped from the h4 tag.
# Steam's emplyees are totally using XP or something, cuz ISO characters are everywhere.
def game_name
raw_name = @ele.css('.search_name h4').first
# Killing special characters, Encoding::ISO_8859_1 was a pain to convert to UTF-8 versions.
raw_name ? raw_name.content.gsub(/[\s]+/, " ").gsub(/[^a-zA-Z0-9 :\-\&\.,]/, '') : ''
end
# Parse out the game price... for some reason.
def game_price
raw_price = @ele.css('.search_price').first.content.split('$').last
raw_price ? raw_price : 'Free to Play'
end
end
#
# Executing script (main)
#
ss = SteamGameStrape.new
# As long as results come back, keep fetching new pages.
# This needs _much_ more robust handling.
while res = ss.next_page do
res.each do |row|
puts row
end
# Stall execution for five seconds to be nice to Steam servers.
sleep( 5 )
end

If you are only here to rip off a list of games, open this up in excel. Otherwise clone it like you know what you’re doing.

git://github.com/johnnyfuchs/steamredux.git

Branch “blog_part_1” will have the code for this post. Next week I might go down the rabbit whole of putting these in a Neo4j database. I also might be setting up the script to scrape user data. Whichever I’m least likely to not do.

Standard