Scrapebox/Hrefer Ruby Alternative in 5 Minutes

Posted by on May 24, 2012

A huge part of a linkbuilding campaign’s success, regardless of the hat color and the kind of links being built, is a proper selection of resources. You come up with your platform footprints or whatever other qualifiers, you come up with the stuff you definitely do not want and add that as your negative query modifiers, you go to Google and search. Better yet, you set up a tool to search for you, as, given a certain volume of query variations you might want to have a look at, it can be quite a time consuming job.

Those doing blackhat SEO should be familiar with the existing tools out there that can be used for collecting resource lists based on certain queries and footprints, most popular ones being Scrapebox and Hrefer which is part of XRumer. However, these tools have their limitations.

Scrapebox, besides collecting resources and its most obvious use for comment spamming, has recently got into limelight as a keyword research tool completely applicable even for whitehats (see slide 18 of this presentation by Mike King – the rest of it is well worth seeing as well, no matter what your hat color is). However, you have to think really hard on all your query variations before running this one and maybe even prepare them in an Excel file by combining all the options (that’s what I used to do anyways).

Hrefer plugin for XRumer, on the other hand, does at least part of your job for you by combining your footprint queries with a huge list of common English words so that as a result you break through Google’s 1,000 results per query limit. However, its queries can only be modified to some extent, and paying a one-time fee of $500 plus recurring monthly fees for the software out of which you only intend to use the SERPs scraping plugin is not something I’d expect a sane person to do. Paying for the list of modifiers is not worth it either as you can get it for free, this list is just the starting point for you to begin with and it’s free.

As some of you may know, I am not a professional programmer but use Ruby for quick hacking of a script or two to cover my immediate needs. Building a Hrefer alternative seemed like too easy of a task not to do it so here is what I came up with. Here is the source code and below I’ll do a bit of explaining for those non-programmer types out there:

require ‘rubygems’
require ‘cgi’
require ‘hpricot’
require ‘open-uri’

#Your main list of queries goes here, each one from a new line
queries = ‘”Powered by SMF” inurl:”register.php”
“Powered by vBulletin” inurl:forum
“Powered by vBulletin” inurl:forums
“Powered by vBulletin” inurl:/forum
“Powered by vBulletin” inurl:/forums
“Powered by vBulletin” inurl:”register.php”‘

#Any additional URL-level tweaks for your query can go here
urlmodifiers = ‘site:.edu
site:.ac.uk
site.org
site:.com
-site:.info’

#Any keywords you’d like to add to your query or any negative query modifiers can go here
qmodifiers = ‘payday loans
mortgage
-viagra’

queries = queries.split(“\n”)

queries.each do |query|
query = CGI.escape(query).gsub(/\s/, “+”)

urlmodifiers.each do |urlmod|
urlmod = CGI.escape(urlmod).gsub(/\s/, “+”)

qmodifiers.each do |qmod|
qmod = CGI.escape(qmod).gsub(/\s/, “+”)

url = ‘http://www.google.com/search?hl=en&q=’ + query + urlmod + qmod + ‘&btnG=Search&num=20’

#Uncomment the below line if you wish to output the actual query string you are scraping the results from before each query’s results are listed
#puts url
doc = Hpricot(open(url, “UserAgent” => “Whatever You Make It”))
links = doc/”//h3[@class=r]/a”
links.each {|link|
puts link.attributes[‘href’].gsub(/\/url\?q=/, “”).gsub(/&sa=.+/, “”)
}
end
end
end

So what does it do and how to use it? Well, as you can see, there are 3 arrays that you populate each with your query elements: queries is your list of platform footprints, urlmodifiers is the list of your specific requirements to the TLDs you wish to see in your result output, qmodifiers is the list of any specific keywords you want to target (or, as an option, exclude from your search). You can modify these 3 lists as you like, you can even use the qmodifiers array to add a list of most commonly used English words. For each element in these lists, we need to do certain things to make sure we don’t break the query URL for Google. Namely, replace all spaces between words with “+” signs (done by gsub(/\s/, “+”) – that’s a regular expression) and replace all quotes and other special symbols with their respective HTML codes (done by CGI.escape() ).

Next the script has 3 cycles built into each other so that we first take the first footprint from the queries and combine it with the first modifier from the urlmodifiers and the first keyword from the qmodifiers, then we combine the first footprint from the queries with the first modifier from the urlmodifiers and the second keyword from the qmodifiers, and so on until we use all the keywords in the qmodifiers list, then we move on to the second modifier in urlmodifiers and go through the same kind of combinations until we exhaust them all… you get the point. As a result, the script runs the searches in Google for every possible combination of all the variants you added into each of the 3 lists and comes up with the URLs of the results from Google (I’ve set the result count to 20 with this parameter in the URL string: &num=20 – but you can do 100 or even add pagination and parse all 1,000 results for each query if it makes sense for your specific footprints – though for some of them, given all the limitations we come up with, you might only get a couple results if any, e.g. I have not seen anything at all getting found with “Powered by SMF” inurl:”register.php” + site:.edu + payday loans). Before printing out each found result, we clean it out a bit by removing the garbage Google adds in its results page code to the URLs being listed.

Furthermore, there is little limit to how you can modify this script to fit your needs. You can e.g. replace google.com in the URL with yet another variable and set an array with a bunch of regional versions of Google. You can also add localization parameters to the URL by using something like &meta=cr%3DcountryFR and if you add them through another array variable you will have a bunch of countries in your list. You can vary your user-agent if you’re really paranoid about Google flagging you for automated scraping. You can use proxies (these would have to be added right after the user agent as another attribute of the open() method we’re using for accessing a specific URL). You can even, instead of printing out raw result URLs you have scraped, add them all to another array as they get collected and then filter out the duplicates, either at the actual URL level or domain level, much like Scrapebox does. It’s all up to your imagination.

To be able to run this script, you will of course need Ruby installed on your computer (Macs come with Ruby pre-installed BTW) and all the gems (that’s what libraries are called in Ruby) referred to in the require statements at the beginning of the script installed as well. You can get Ruby here and you can read more about it here.

Any questions anyone, ask away.

Tools

← How Not to Email Spam

No, This Site Will Not Exchange Links →

Irishwonder’s Black Hat SEO Blog

A blog about blackhat, general SEO issues and other things related to the life on the web