Scraping Search Engines

“Web scraping” means “scraping” the contents off a web page. For regular web pages, you probably want the text (for analysis, perhaps). In the case of search engines, you presumably want the links off the results pages.

The problem with scraping search engines, though, is that the results pages have links that you don’t want. You don’t want the default link back to the homepage, for example.

One solution to this is to scrape twice.

Scrape the homepage and create a list, array, or whatever, of all those links. Obviously, homepage links are not results links.

Once you have those links, you can then scrape the results page. Create a new list, array, or whatever, with only the links that are NOT on the homepage. Using your programming language of choice, simply check that each link is not in the first list (pardon the Python) before appending it to the second list.

In addition to this, make sure that the links start with “http.” Hopefully all the absolute hyperlinks are external. You definitely don’t want relative hyperlinks, because none of those will be external.

The end result of all this is, hopefully, the scraping of only the actual search results. I haven’t tested every search engine, but it works with Google.

Leave a comment