On a semi-related note, I registered the domain “steamredux.com” for this project. Apparently, redux is Latin for “revived from the dead”. That means SteamRedux makes all of zero sense. Aaaaaaand there goes ten bucks that could have bought me a six pack of something 6.7%.
Googling for a list of Steam games got me nowhere. I landed on the Steam search page which contained a promising little footer , indicating that I could just walk through the pages to find every Steam game available. This revelation was “the hard part” of creating a recommendation engine, probably.
My day job is writing javascript with jQuery for DOM manipulation, and my Github account has a couple ignored nodejs projects. This technology stack is perfect for screen scraping. Instead, the code here uses Ruby and the Nokogiri gem. I don’t know Ruby at all (or Latin it seems). Ruby has a bunch of pimp features that I’m not using. Please gloss over the completely synchronous network calls and unclear code, it’s not a reflection of the language, just my recovery Sunday laziness. Even my dog is exhausted from a big night chewing bones. (Something something your mom)
Okay, brass tacs. The Steam search page http://store.steampowered.com/search/ ajaxes in full blocks of HTML from http://store.steampowered.com/search/results with a “page” GET parameter. In that page is an empty div full of blocks that look like this:
<a class="search_result_row even" href="http://store.steampowered.com/app/211160/?snr=1_7_7_230_150_24"> | |
<div class="col search_price"> $14.99 </div> | |
<div class="col search_type"> | |
<img src="http://cdn2.store.steampowered.com/public/images/ico/ico_type_app.gif"> | |
</div> | |
<div class="col search_metascore"></div> | |
<div class="col search_released"> Oct 17, 2012 </div> | |
<div class="col search_capsule"> | |
<img src="http://cdn2.steampowered.com/v/gfx/apps/211160/capsule_sm_120.jpg?t=1350669009" alt="Buy Viking: Battle for Asgard" width="120" height="45"> | |
</div> | |
<div class="col search_name ellipsis"> | |
<h4>Viking: Battle for Asgard</h4> | |
<p> | |
<img class="platform_img" src="http://cdn2.store.steampowered.com/public/images/v5/platforms/platform_win.png" width="22" height="22"> | |
Action, Adventure - Released: Oct 17, 2012 | |
</p> | |
</div> | |
<div style="clear: both;"></div> | |
</a> |
The highly sophisticated code below is what fetches all the games, and parses out the content from the above block of HTML a few thousand times. Thank you computer. Comments say what each part does because making 8 Gists sounds worse than a hangover.
require 'open-uri' | |
require 'nokogiri' | |
require 'cgi' | |
class SteamGameStrape | |
def initialize | |
@page = 1 | |
@url = "http://store.steampowered.com/search/results" | |
@user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22" | |
@row = 0 | |
@doc | |
@ele | |
end | |
# Increment the requested page number. | |
# Save network result to a Nokigiri HTML document. | |
# Return formatted results (game id, name, type, etc) | |
def next_page | |
@page = @page.next | |
doc_content = open("#{@url}?page=#{@page}", "User-Agent" => @user_agent) | |
@doc = Nokogiri::HTML( doc_content ) | |
page_results = self.parse_page | |
page_results | |
end | |
# Finds all the blocks that match .search_result_row using a css selector. | |
# Set a row and element reference to be used by parsing functions. | |
# Return array of parsed results. | |
def parse_page | |
rows = @doc.css('.search_result_row') | |
results = [] | |
rows.each do |ele| | |
@row = @row.next | |
@ele = ele | |
id = self.game_id | |
if id | |
results << self.format_next_row( id, self.game_price, self.game_name, self.game_type ) | |
end | |
end | |
results | |
end | |
# Basically just joins arguments with tabs for later use. | |
def format_next_row( id, price, name, type) | |
"#{@row.to_s}\t#{id}\t#{type}\t#{price}\t#{name}" | |
end | |
# The "game_type" is really "content_type" as search results also include guides, mods, etc. | |
# The only place that has this information is the icon for the game type, | |
# fortunately it's semantically named. | |
def game_type | |
img_ele = @ele.css('.search_type img') | |
if img_ele && img_ele.length && img_ele.first.attribute('src') | |
type = img_ele.first.attribute('src').content.scan(/(app|dlc|vid|mod|guide)/).first.first | |
else | |
type = '???' | |
end | |
type | |
end | |
# The game id is Steam's external (maybe internally too?). | |
# This is pulled from the permalink to the game page. | |
def game_id | |
app = @ele.attribute('href').content.scan(/app\/[0-9]+/) | |
id = app.first ? app.first.gsub(/app\//,'') : nil | |
id | |
end | |
# Game name is scrapped from the h4 tag. | |
# Steam's emplyees are totally using XP or something, cuz ISO characters are everywhere. | |
def game_name | |
raw_name = @ele.css('.search_name h4').first | |
# Killing special characters, Encoding::ISO_8859_1 was a pain to convert to UTF-8 versions. | |
raw_name ? raw_name.content.gsub(/[\s]+/, " ").gsub(/[^a-zA-Z0-9 :\-\&\.,]/, '') : '' | |
end | |
# Parse out the game price... for some reason. | |
def game_price | |
raw_price = @ele.css('.search_price').first.content.split('$').last | |
raw_price ? raw_price : 'Free to Play' | |
end | |
end | |
# | |
# Executing script (main) | |
# | |
ss = SteamGameStrape.new | |
# As long as results come back, keep fetching new pages. | |
# This needs _much_ more robust handling. | |
while res = ss.next_page do | |
res.each do |row| | |
puts row | |
end | |
# Stall execution for five seconds to be nice to Steam servers. | |
sleep( 5 ) | |
end |
If you are only here to rip off a list of games, open this up in excel. Otherwise clone it like you know what you’re doing.
git://github.com/johnnyfuchs/steamredux.git
Branch “blog_part_1” will have the code for this post. Next week I might go down the rabbit whole of putting these in a Neo4j database. I also might be setting up the script to scrape user data. Whichever I’m least likely to not do.