Category Archives: SEO

Mobile SEO Dynamic Serving Tests

In December 2011 Google announced they were adding smartphone crawling to their mobile crawler Googlebot-Mobile, which at the time handled feature phone crawling. The key sentence from the announcement above was:

The content crawled by smartphone Googlebot-Mobile will be used primarily to improve the user experience on mobile search. For example, the new crawler may discover content specifically optimized to be browsed on smartphones as well as smartphone-specific redirects.

This new functionality from Google produced some very interesting behaviour in the mobile search results. In March 2012, I provided the research that helped Cindy Krum put together a piece for Search Engine Land on the impacts of the new smartphone crawler from Google.

Fast forward to January 2014 and Google announced another substantial change to how they were going to handle smartphone crawling moving forward. To simplify configuration for webmasters and in response to the prolific growth in smartphone usage, Google moved smartphone crawling from Googlebot-Mobile into Googlebot.

Google subsequently published new recommendations and guidelines for making a website mobile friendly. After reading through this documentation, there were still quite a lot of questions unanswered and it wasn’t clear if anything about Google’s position in 2011 had changed now that smartphone crawling had a new home within Googlebot.

Current State Of Play

Since June 2012 responsive web design has been the recommended approach by Google, however they also support mobile specific websites and dynamic serving. Websites that use responsive web design are helpful for Google, as they crawl the site once with Googlebot and get all the information needed. It gets more complicated for website owners and Google when mobile specific websites or dynamic serving is involved and this is where Googlebot (smartphone) plays a role in helping Google understand what user experience a website delivers.

The role of Googlebot-Mobile when it was crawling with a smartphone user agent or now Googlebot (smartphone) is quite well understood for mobile specific websites. The smartphone crawler from Google will detect user agent based redirects, faulty redirects, mobile app download intersituals and a variety of other elements. Google uses this information to optimise the search experience for users by linking directly to mobile content and avoiding redirects where possible, correctly returning mobile optimised URLs thanks to rel=”alternate” tags and so forth.

What isn’t that well understood is how Google handles dynamic serving and what role Googlebot (smartphone) might play in that now that crawling responsibilities have been moved from Googlebot-Mobile over to Googlebot.

To help gain some additional clarity on the impact of dynamic serving in SEO, not just for mobile SEO but search engine optimisation in general – I put together a series of tests. The tests weren’t meant to be exhaustive but aimed to cover off enough functionality to better understand the impacts and risks of using dynamic serving and what role Googlebot and Googlebot (smartphone) may play in it.

Algorithms Determine If Smartphone Crawling Is Needed

Initially the tests were deployed onto Convergent Media, which runs WordPress and uses a responsive web design template. It only took a couple of days for Googlebot to discover and begin crawling through the test setup. After waiting a week, still no Googlebot (smartphone) – which I thought was odd at the time. Waiting, more waiting and more waiting, still no Googlebot (smartphone) crawling of any of the test URLs.

I reached out to John Mueller to ask about the situation I was seeing unfold and he said:

We don’t crawl everything as smartphone, but when we recognize it makes sense, we’ll do that. For responsive design, the good thing is that we don’t need to crawl it with a smartphone — once crawl is enough to get all versions.

Now it makes sense why Googlebot (smartphone) wasn’t visiting the test setup, Convergent Media uses a responsive web design and Google’s algorithms had decided it wasn’t needed.

It is worth noting that despite Googlebot discovering the test URLs and those URLs sending signals such a HTTP Vary response headers, it wasn’t strong enough a signal to trigger Googlebot (smartphone) to visit the site. If there were more pages in the site that were not using responsive web design, maybe that’d have caused Googlebot (smartphone) to visit for example but it wasn’t happening as part of the test setup.

Google have a vast amount of computing resource but double crawling every URL on the web was obviously out of the question. The comment from John when we recognize it makes sense got me thinking about the traits that Google might be looking for in a website that might trigger Googlebot (smartphone) to begin crawling a site such as:

  • discover common ‘mobile website’ style links on the desktop website
  • discover links to mobile app stores, suggesting website owner is switched on/aware of mobile specific user experiences
  • discover rel=”alternate” mobile tags on desktop
  • discover HTTP Vary response header on desktop
  • m/mobile subdomain verified in Google Webmaster Tools
  • discover m/mobile subdomain XML sitemap referenced via desktop robots.txt file as a cross domain submission
  • discover m/mobile subdomain
  • crawl m/mobile subdomain taking note of key HTML elements like meta viewport
  • crawl m/mobile subdomain taking note of common HTML/CSS/JavaScript frameworks in use, such as jQuery Mobile

If you’re wondering why your site isn’t getting the attention you think it deserves moving forward from Googlebot (smartphone), it’d be worth considering some of the above points and others with respect to sending Google the right kind of signals that you’re in the mobile space.

Googlebot (smartphone) Mobile SEO Tests

Not wanting to be dissuaded from completing the test I reached out to Dan Petrovic of Dejan SEO. While the Dejan SEO website runs over WordPress, it doesn’t currently have a responsive web design implemented. I asked Dan if he could check for Googlebot (smartphone) activity and if he’d be willing to host my test files, the answer was yes on both counts!

Eight tests were implemented with a goal to determine:

  1. if there are crawling differences with/without the HTTP Vary response header
  2. if URLs served only to Googlebot (smartphone) are used for discovery
  3. if URLs crawled only by Googlebot (smartphone) are indexed
  4. if meta robots noindex tags served to Googlebot (smartphone) are actioned
  5. if rel=”canonical” tags served to Googlebot (smartphone) are actioned
  6. if HTTP X-Robots-Tag noindex headers served to Googlebot (smartphone) are actioned
  7. if HTTP Link rel=”canonical” response headers served to Googlebot (smartphone) are actioned
  8. if anchor text seen by only Googlebot (smartphone) has an impact on rankings

Test 1

Aim: Determine if there are crawling differences with/without the HTTP Vary response header.

Implementation:

Four files were created, file 1 & 2 serve the same content to both Googlebot/Googlebot (smartphone), adding a Vary header in the latter. Files 3 & 4 serve different content based on the user agent, adding a Vary header in the latter.

Results:

No measurable crawling differences in the URLs. This could simply be that the sample size is very small or that adding a HTTP Vary response header on its own isn’t a sufficiently strong signal to influence crawl rate of Googlebot (smartphone).

Test 2

Aim: Determine if URLs served only to Googlebot (smartphone) are used for discovery.

Implementation:

Two test URLs were setup using dynamic serving, the latter of the two also uses the HTTP Vary header. The mobile versions seen by Googlebot (smartphone) both link to unique URLs not seen by Googlebot. The unique URLs are available for both Googlebot/Googlebot (smartphone) to crawl.

Results:

Only the two desktop URLs were indexed. The dynamic served content on those URLs can’t be queried for in Google successfully. The unique URLs only seen by Googlebot (smartphone) have been crawled by Googlebot (smartphone) only but have not been indexed.

Test 3

Aim: Determine if URLs crawled only by Googlebot (smartphone) are indexed.

Implementation:

Both tests use dynamic serving, the latter of the two also uses the HTTP Vary header. The mobile versions of both tests link to unique URLs not seen by Googlebot. The unique URLs are crawlable by Googlebot (smartphone), however Googlebot will receive a HTTP 403 (Forbidden) response code.

Results:

Only the two desktop URLs were indexed. Googlebot (smartphone) accessed the unique URL associated to file 1, however like Test 2 – it wasn’t indexed. Unfortunately Googlebot did not access either of the unique mobile URLs to receive a HTTP 403 response code however I suspect the outcome wouldn’t have changed.

Test 4

Aim: Determine if meta robots noindex tags served to Googlebot (smartphone) are actioned.

Implementation:

Both tests use dynamic serving, the latter also uses the HTTP Vary header. The mobile versions of both tests include a meta noindex tag that isn’t in the desktop HTML counterparts.

Results:

Both the desktop URLs are indexed. It appears the meta noindex tag served to Googlebot (smartphone) was ignored.

Test 5

Aim: Determine if rel=”canonical” tags served to Googlebot (smartphone) are actioned.

Implementation:

Tests 1 & 2 have no rel=”canonical” tag on the desktop content, but do on the mobile version and the latter also has a Vary header. Tests 3 & 4 have a different rel=”canonical” tags on desktop and mobile versions, the latter also includes a Vary header. Tests 5 & 6 return HTTP 403 (Forbidden) to Googlebot, Googlebot (smartphone) can crawl the URLs which both include a rel=”canonical” tag, the latter also has a Vary header.

Results:

The desktop URLs of test 1 and 2 were indexed, ignoring the rel=”canonical” served to Googlebot (smartphone). The desktop URLs of test 3 and 4 were indexed, ignoring the rel=”canonical” tag served to Googlebot (smartphone). Test URLs 5 and 6 which served HTTP 403 response codes to Googlebot were not indexed and Googlebot (smartphone) didn’t crawl those URLs.

Test 6

Aim: Determine if HTTP X-Robots-Tag noindex headers served to Googlebot (smartphone) are actioned.

Implementation:

Tests 1 & 2 use dynamic serving and have HTTP X-Robots-Tag noindex response headers added to the mobile version with test 2 also including a HTTP Vary response header.

Results:

Both desktop URLs were indexed, the HTTP X-Robots-Tag noindex response header was ignored when served to Googlebot (smartphone).

Test 7

Aim: Determine if HTTP Link rel=”canonical” response headers served to Googlebot (smartphone) are actioned.

Implementation:

Tests 1 & 2 use dynamic serving and include a HTTP Link rel=”canonical” response header in the mobile version of the page, the latter also includes HTTP Vary response headers.

Results:

The desktop URLs of both tests were indexed, ignoring the HTTP Link rel=”canonical” response headers served to the mobile versions. Like Test 5 above that was testing rel=”canonical” meta tags, the presence of the HTTP Link rel=”canonical” response header did trigger Googlebot (smartphone) to visit the referenced canonical URL.

Test 8

Aim: Determine if anchor text seen by only Googlebot (smartphone) has an impact on rankings.

Implementation:

Tests 1 & 2 use dynamic serving, the latter also has HTTP Vary headers. The mobile versions of each test link to a URL not seen by Googlebot with anchor text unrelated to the content on the linked URL. The unique URL linked to by the mobile versions of the content is accessible to Googlebot (smartphone), however Googlebot will receive a HTTP 403 (Forbidden) error trying to crawl the unique URLs.

Results:

The desktop URLs for each test were indexed, content served into the mobile specific versions couldn’t be queried. Like earlier tests, the HTTP 403 response code served to Googlebot for the unique mobile URLs meant those URLs weren’t indexed.

Conclusion

Now that the tests have been completed, it is helpful to have a much better understand of the role of Googlebot (smartphone) and its capabilities now that smartphone crawling has been moved over to Googlebot.

The key take away points from the above:

  • Simply adding a HTTP Vary response header on its own didn’t appear to have an impact on crawl rate or the outcomes of any subsequent tests. However, it should be added when user agent detection is being used as it is a strong recommendation from Google and it represents best practice to help intermediate HTTP caches on the internet.
  • Googlebot (smartphone) is being used for URL discovery, via in content, meta rel=”canonical” and HTTP Link rel=”canonical” response headers.
  • Googlebot (smartphone) appears to ignore meta robots noindex, HTTP meta robots noindex response headers, meta rel=”canonical” and HTTP Link rel=”canonical” directives. While an exhaustive list of all possible options was not tested, it is reasonable to assume that if common directives like meta noindex are ignored (something that Google will always honour via Googlebot), that all other meta style directives will also be ignored.
  • Googlebot (smartphone) does not appear to index unique content that is processes via dynamic serving. It does not appear as though it is possible to query Google and return that unique content via a desktop browser or mobile device.
  • Googlebot (smartphone) appears to be used primarily for understanding web site user experience (ie, does website X provide a mobile experience) and optimising the user experience where possible (ie, skipping redirects leading to m.domain.com, returning the correct URLs in search results via rel=”alternate” meta tags).
  • Despite the amazing growth of mobile, it appears that crawling, indexing and ranking is largely based upon data processed by Googlebot.

 

Budweiser – Rapid Fire SEO Audit

Budweiser Logo

SEMrush recently published an article titled Why Budweiser Gets An “F” In SEO.

A brand as massive as Budweiser should have no problems ranking in search engines for all manner of relevant terms, however based on the comments from Ryan Johnson – clearly that isn’t the case.

Following are some of the items identified after completing a rapid fire SEO website audit of budweiser.com to see what other sorts of issues might be causing them problems:

Domains

Budweiser have a lot of sub-domains configured, which in and of itself isn’t a problem. However, it does become a problem when settings aren’t configured properly. In a few seconds the following list of sub-domains showed up:

  • www
  • m
  • mobile
  • riseasone
  • p12
  • madeinamerica
  • new
  • qa
  • qa.m
  • qa.new
  • qa.qr
  • qa.riseasone
  • origin
  • origin-www
  • qr

Many of the sub-domains are development versions of the site such as the qa.* or new. In an ideal world only the primary website would be crawled and indexed by search engines. Having so many copies of budweiser.com indexed poses a duplicate content issue for the site and could lead to problems down the road.

robots.txt

robots.txt files are used to control what content spiders are allowed to crawl, but they don’t control indexing (a common misconception). The robots.txt file used across many of the sub-domains listed above are incorrectly configured.

A few issues that appeared at a glance:

  • multiple blocks for the same user-agent
  • attempting to disallow a domain instead of a URL on the current domain
  • incorrect usage of the * and specific spider user agent blocks

In the first issue above, since the directives are split across multiple blocks – it could lead a spider to pick either block of directives without combining them, leaving half of the intended URLs blocked available for crawling.

The second issue is a massive problem, as Budweiser have disallow directives that attempt to block a domain from being crawled which isn’t a supported feature of the Robots Exclusion Protocol. As such, those domains which they had intended to block from crawling are available for crawling except for the URLs correctly specified within the robots.txt file.

Spiders that honor robots.txt will pick the most specific user-agent block for their crawler, falling back onto less specific, then into the wild card and if nothing is present they’ll assume the website is fully available for crawling.

The Budweiser robots.txt file has a block for Googlebot-Image, however has no disallow directive. That may be considered invalid and the block is ignored entirely, causing Googlebot-Image to fallback into a less specific block if one exists. In this particular instance, since the less specific blocks allow images to be crawled – it is unlikely to be causing a problem but it should be corrected as a matter of hygiene.

Following on from the above, there are two Google-specific blocks defined within the robots.txt:

  • Googlebot-Image
  • Google

The support documentation by Google on their list of crawlers doesn’t mention a crawler named ‘Google’. All of the spiders that do support falling back to a less specific Google spider, fallback to ‘Googlebot’. This behaviour documented by Google means that all directives specified in the block for the user agent ‘Google’ will be ignored and they’ll fallback to the wild card * entry.

Next on the agenda are disallow directives in the wild card block not present in the name specific blocks. Again, while not a problem in and of itself – after reviewing the content of the blocks, Budweiser’s intention versus what is actually happening aren’t in sync.

XML Sitemap

Good news, Budweiser are generating an XML sitemap.

Even more good news, it is linked from robots.txt for easy discovery by all relevant bots.

Bad news, crawling www.budweiser.com returned 81 web pages however only 20 of those pages are listed within the XML sitemap to help search engines discover, crawl and index the Budweiser content.

More bad news, the XML sitemap links to broken URLs.

Internal Redirects

Crawling www.budweiser.com with a tool identified over 6,500 internal redirects within the site.

Each time Google processes a redirect, a small amount of the equity in the 10-20% range, that Google would have passed to the linked URL is needlessly lost.

While Budweiser are correctly using HTTP 301 permanent redirects, they should simply update their internal links to point directly to the intended URL. In time the site will recover the lost equity and it has the added benefit of speeding the site up slightly for users as well.

URL Canonicalisation

URL canonicalisation is a process where one true URL is defined for a given resource.

To provide an example, search engines might find thousands of links to the home page of www.budweiser.com with marketing campaign tracking codes, which they consider completely separate URLs by default. Correctly configuring the rel=”canonical” meta tag or HTTP response header provides a mechanism to instruct search engines to merge all of the equity split over thousands of URLs into the true home page URL, boosting its strength and capacity to rank in the search results.

Budweiser are canonicalising the content throughout their site inconsistently, some URLs include a rel=”canonical” meta tag while others don’t. Crawling www.budweiser.com yielded 81 web pages however only 37 of them appeared to have the rel=”canonical” meta tag specified.

Additionally, there were examples where Budweiser have a rel=”canonical” tag specified with the wrong URL. For example the brewery locations page, www.budweiser.com/our-brand/brewery-locations.html has a canonical value of http://www.budweiser.com/our-brand/brew-location.html which produces a 404 error.

Since approximately 50% of the site doesn’t have a rel=”canonical” tag specified and in some instances, incorrectly configured, it’s possible there is a lot of equity or PageRank being squandered through poor configuration.

rel=”nofollow”

The rel=”nofollow” meta tag or attribute instructs Google to drop any links effected by the nofollow directive from their link graph. This is commonly used for links to third party websites that might be untrusted (ie, submitted via user generated content) or also for advertising.

By removing the effected links from the link graph, it has a knock on effect that those links inherently cannot play a role in boosting the search engine rankings of the linked URL, since the link is removed from the graph and no equity or PageRank can flow through that link.

Simplistically, when Google calculates how much PageRank or equity to flow through a URL, they look at the equity of the linking URL and divide that into the number of out links equally.

In years gone by, applying a rel=”nofollow” to an internal link meant that the equity that Google had originally allocated to that link would be reallocated to all other equity flowing out links, increasing the amount of equity flowing through those links. This technique of maximising the equity flowing through specific links within a site became known as PageRank sculpting.

Several years ago Google changed the behavioiur of how internal rel=”nofollow” links were  handled and instead of reallocating the equity of the effected out links to all other equity flowing out links, that equity now vanishes or evaporates.

On quick inspection, Budweiser have internal rel=”nofollow” links pointing to the following URLs (maybe more):

  • /content/budweiser/en/loginpage.html
  • /content/budweiser/en/registerpage.html

<title> tags

The <title> tag is a strong indicator to Google about the content they should expect to find in a given page and is displayed prominently in the search results for users to evaluate whether or not a given URL would yield the content they are looking for.

Broadly speaking the <title> tags in use throughout the Budweiser website are okay. For example, reviewing the <title> tags used throughout the Budweiser Clydesdales blog shows that they lead in the a descriptive title of the page, they aren’t bloated, nor are they keyword stuffed.

However, there are number of high priority pages that could be improved, such as:

  • /shop.html
    Shop
  • /our-beers/product-locator.html
    Product Locator

<hX> tags

The rules of headings are pretty basic, no rocket science needed:

  • use one <h1> tag per page that describes the primary content of the page
  • if you need more headings, use <h2> through <h6>
  • nest heading tags as needed to give hierarchy to the document
  • use descriptive headings to help users and search engines alike understand it better

These simple rules are being broken throughout the Budweiser website:

  • first <h1> is actually wrapped around the logo
  • there are multiple <h1> tags in each page
  • there is no testing of <hX> tags to create hierarchy if and when needed
  • <h1> tags for the primary body copy area often aren’t descriptive or relevant to the content

Fortunately, this is a fast and simple problem to correct throughout the Budweiser website.

Content

Completing in depth* keyword research using Ubersuggest highlights a variety of topics that users want information on related to the Budweiser brand which is great news.

Unfortunately the Budweiser website suffers from an all too common condition of being brochureware, in that it looks good but has no real substance or content to help search engines.

Take for instance the Budweiser product Chelada. As a consumer, it’d be a reasonable expectation to head to Google and type in ‘chelada’ and find Budweiser in a top 5-10 positions but no. No problem, the consumer refines their query to ‘chelada beer’, still nothing. More refinement, ‘chelada beer budweiser’ and even with the Budweiser keyword – budweiser.com still doesn’t have a position 1 ranking.

However when reviewing the Chelada web page, Budweiser are giving Google nothing to work with – the only way they could have reasonably given Google less was to delete the page entirely from their website.

As an immediate step for Budweiser, they should perform keyword research for all of their products and build out the relevant content consumers are seeking. Budweiser could expect to see the Budweiser website bounce to the top of the search results if they do a good job of this.

* submit it once with the keyword ‘budweiser’ and scan the results

Structured Markup

Structured markup such as schema.org allows a publisher to provide rich meta data about the content on the page, which search engines like Google use to augment the search results. Common use cases that are very visible are elements like reviews that can produce star ratings in the search results.

Quickly clicking through budweiser.com, it appears they have a few opportunities for this:

  • brewery locations
  • beer nutritional information

With respect to the first point, the brewery page indicates that they have 12 brewery locations across the United States, however no additional information is provided about those locations. It’d be practical and helpful to users to provide their address, phone number, opening hours, if they offer tours, sell products and so forth. Some of this information could be marked up using the LocalBusiness schema.org element.

The extensive keyword research performed clearly indicated the consumers are interested in the nutritional information of the Budweiser products. If Budweiser were to provide detailed nutritional information, it’d help Google, help users and also allow them to mark up that information using the NutritionInformation schema.org object – which may lead to interesting universal objects appearing in the search results.

Load Time Performance

Users don’t like slow websites.

Assessing the Budweiser website with a variety of performance testing tools such as:

  • Google PageSpeed Insights
  • webpagetest.org
  • Y!Slow
  • GT Metrics

all reveal a common theme, budweiser.com could do with some serious attention.

Conclusion

This fast paced SEO audit has identified a variety of technical and on-site issues that are holding Budweiser back in the search results. No doubt if a more structured and rigorous audit was completed, the list would be even longer but the above items certainly represent an excellent starting point to improve the search engine rankings for budweiser.com.

Why Doesn’t Google Ignore Manipulative Links?

Recently Dan Petrovic published an excellent article titled The Great Link Paradox which discusses the changing behaviour of website owners regarding how they link from one page to another, the role Google is playing in this change and their fear mongering.

The key takeaway from the article is a call to action for Google:

At a risk of sounding like a broken record, I’m going to say it again, Google needs to abandon link-based penalties and gain enough confidence in its algorithms to simply ignore links they think are manipulative. The whole fear-based campaign they’re going for doesn’t really go well with the cute brand Google tries to maintain.

I’d like to talk about the following in the above quote, simply ignore links they think are manipulative but before that, let’s take a step back.

Fighting Web Spam

As Google crawls and indexes the internet, ingesting over 20 billion URLs daily, their systems identify and automatically take action on what they consider to be spam. In the How Google Search Works micro-site, Google published an interesting page about their spam fighting capabilities. Set out in the article are the different types of spam that they detect and take action on:

  1. cloaking and/or sneaky redirects
  2. hacked sites
  3. hidden text and/or keyword stuffing
  4. parked domains
  5. pure spam
  6. spammy free hosts and dynamic DNS providers
  7. thin content with little or no added value
  8. unnatural links from a site
  9. unnatural links to a site
  10. user generated spam

In addition to the above, Google also use humans as part of their spam fighting tools through the use of manual website reviews. If a website receives a manual review by a Google employee and after review it is deemed to have violated their guidelines, a manual penalty can be applied to the site.

Manual penalties come in two forms, site-wide or a partial penalty. The former obviously effects an entire website, all pages are going to be subject to the penalty, while the latter might effect a sub-folder or an individual page within the website.

What the actual impact of the penalty is varies as well, some penalties might see an individual page drop in rankings, an entire folder drop in rankings, an entire site drop in rankings, it might only affect non-brand keywords, it could be all keywords or it might cause Google to ignore a subset of inbound links to a site – there are a litany of options at Google’s disposal.

Devaluing Links

Now that Google can detect unnatural links from and to a given website, the next part of the problem is being able to devalue those links.

Google has had the capacity to devalue links and has been able to do so since at least January 2005 when they announced the rel=”nofollow” attribute in an attempt to curb comment spam.

For the uninitiated, normally when Google discovers a link from one page to another, they will calculate how much PageRank or equity should flow through that specific link to the linked URL. If a link as a rel=”nofollow” attribute applied, Google will completely ignore the link, as such no PageRank or equity flows to the linked URL and it does not impact organic search engine rankings.

In addition to algorithmically devaluing links explicitly marked with rel=”nofollow”, Google can devalue links via a manual action. For websites that have received a manual review, if Google doesn’t feel confident that the unnatural inbound links are deliberate or designed to manipulate the organic search results, they may devalue those inbound links without penalising the entire site.

Google Webmaster Tools Manual Action Partial Match Unnatural Inbound Links

Google Webmaster Tools: Manual Action – Partial Match For Unnatural Inbound Links

Why Aren’t Google Simply Ignoring Manipulative Links?

At this stage, it seems all of the ingredients exist:

  • can detect unnatural links from a site
  • can detect unnatural links to a site
  • can devalue links algorithmically via rel=”nofollow”
  • can devalue links via a manual penalty

With all of this technological capability, why don’t Google simple ignore or reduce the effect of any links that they deem to be manipulative? Why go to the effort of orchestrating a scare campaign around the impact of a good, bad or indifferent links? Why scare webmasters half to death about linking out to a relevant, third party websites such that their readers are disadvantaged because relevant links aren’t forthcoming?

The two obvious reasons that immediately come to mind:

  1. Google can’t identify unnatural links with enough accuracy
  2. Google doesn’t want to

Point 1 above doesn’t seem a likely candidate, the Google Penguin algorithm which rolled out in April 2012 was designed to target link profiles that were low quality, irrelevant or had over optimised link anchor text. If they are prepared to penalise a website by Google Penguin, it seems reasonable to assume that they have confidence in identifying unnatural links and taking action on them.

Point 2 at this stage remains the likely candidate, Google simply don’t want to flatly ignore all links that they determine are unnatural, whether by accident of poorly configured advertising, black hat link building tactics, over enthusiastic link building strategies or simply bad judgement.

What would happen if Google did simply ignore manipulative inbound links? Google would only count links that their algorithms determined were editorially earned. Search quality wouldn’t change, Google Penguin is designed to clean house periodically through algorithmic penalties and if Google simply ignored the very same links that Penguin targeted – those same websites wouldn’t haven’t risen to the top of the rankings only to get cut down by Penguin at a later date.

Google organic search results are meant to put forward the most relevant, best websites to meet a users query. Automatically ignoring manipulative links doesn’t change the search result quality but it also doesn’t provide any deterrent for spammers. With irrelevant links simply being ignored, a spammer is free to push their spam efforts into overdrive without consequence and Google wants their to be a consequence for deliberately violating their guidelines.

How effective was the 2005 addition of rel=”nofollow” in fighting comment spam by removing the reward of improved search rankings – no impact. Spam levels recorded by Akismet, the free spam detection company by Automattic, haven’t eased since they launched – in fact comment spam levels are growing at an alarming rate despite the fact that most WordPress blogs have rel=”nofollow” comment links. The parallels between removing the reward for comment spam and removing the reward of spammy links is striking and it didn’t work last time – why would Google expect it to work this time?

Why aren’t Google ignoring unnatural links automatically?

Google doesn’t like being manipulated, period.

Unbeatable SEO Tips – Online Retailer Roadshow

At the beginning of September, Reed Exhibitions contacted me about a public speaking opportunity at Online Retailer Roadshow in Brisbane which came about via Dan Petrovic of Dejan SEO.

The one day conference went off without a hitch and Reed Exhibitions should be commended on finding such as a great list of speakers for the event. I really enjoyed the presentations from Faye, Steve and Josh – they are all doing some really interesting stuff:

The suggested topic for my presentation was great SEO tips for 2014 that’d genuinely help the attendees run a better online business and hopefully make more money online.

Given that the conference was specifically about online retailing, I focused a lot of my tips and suggestions on taking care of common issues with ecommerce product websites, website performance so users don’t have to wait for the site to load, producing great content that outpaces your competitors, link building and a couple suggestions around leveraging rich snippets and advanced social sharing implementations such as using Twitter Cards for example.

This was my first public speaking engagement with an audience of over 100 other professionals. I was quite nervous listening to the other presenters and watching the clock ticking down to my start time but once I got onto the stage – that disappeared and I think I did quite well. Two things I’ve learned out of the experience, speaking to a time allocation is an art – I went over my 30 minute limit and there is an impressive amount of time & effort required to produce the incredible slide decks you see on slideshare – for which I am left wanting but will definitely improve upon for my next presentation.

Using Hashed Keywords Instead Of (not provided) Keyword

Google started providing encrypted search back in 2010 and while the connection between the user and Google was encrypted, Google were still passing the users search query through to websites. In October 2011, Google made a change whereby users logged into their Google Account on google.com would be automatically switched over to HTTPS and in March 2012, Google announced that they were rolling that same change out globally through all of their regional Google portals such as www.google.com.au.

Importantly, unlike the encrypted search product from Google released in 2010 that still passed the users search query through to the destination website, Google are not passing the users search query through to websites as of the changes rolled out in 2011 and subsequently in 2012.

(not provided) Keyword

Google (not provided) Keyword Growth

(not provided) Keyword Growth

The lack of the keyword information being passed through to the destination website manifests itself in web statistics products like Google Analytics with a pseudo-search term known as (not provided).

To provide a high level example of what is happening, if a website received 5000 visits from 5000 different users, each with unique search phrases and all users were using Google secure search – a product like Google Analytics will report all of those 5000 visits against a single (not provided) keyword and aggregate all of the individual user metrics against that one keyword.

In more specific terms, below are some of the issues faced not having search query data:

  • you won’t know how many unique search queries and their respective volumes are entering a site
  • you can’t analyse keyword level metrics like pages/visit, bounce rate, conversion rate
  • you can’t find pages competing with one another inside a site and providing a poor user experience
  • you can’t optimise a landing page based on the users keyword
  • you won’t be able to understand user search behaviour in terms of their research/buy cycle
  • you’ll lose the ability to understand how your brand, product and generic phrases are related to one another
  • you’ll lose the ability to understand how different devices play a role in your marketing efforts to know that the research/buy cycle is different
  • you can’t report on goal completions or goal funnel completion by keyword
  • you can’t report on transactions, average order value or revenue by keyword
  • attribution for a major percentage of a sites traffic is greatly impacted

Hashed Keywords

Google Analytics Hashed Keywords

Example Google Analytics Organic Keyword Report Using SHA-1 Hashing Function

I wondered long ago if Google might consider taking a small step back from their current stance and instead of sending no value for the query through to the destination website in the HTTP REFERER header that they might provide a unique hash for every keyword instead.

For those unaware, hashing algorithms take variable length inputs and output an associated, unique, fixed length output. There are a variety of different hashing functions available, but as an example of their use – SHA-1 is used in cryptography and is part of the security for HTTPS web traffic.

The important thing to understand about this idea, whether it is done through a hashing function or another mechanism, is that the goal would be to replace the users actual query with another unique value that doesn’t disclose or leak the users actual query for privacy reasons.

Using an approach like this isn’t going to address all of the issues raised in the bullet point list above or the longer list of issues the (not provided) keyword introduces, however it improves a businesses understanding of their website and their visitors behaviour without compromising a users right to privacy.

Unintended Side Effects

History will show that as we make advances in one area, often with only the best of intentions, that those best intentions are ultimately twisted, bent and adapted to drive some less than ideal outcomes.

The same can be seen with user privacy, the HTTP REFERER header was designed to help a website owner understand how users move through the internet at large and an individual website. When the HTTP specification was first developed, at the time I’m sure that the inventors didn’t imagine that in the future that simple concept was going to ultimately become a tool to attack a users privacy.

Now the question to ask would be, if Google were to take a couple of steps back from where they are currently and provide a hashed representation of the users query instead of no query data at all – could a website owner, opportunistic marketer or nefarious hacker misuse the hashed query against the user in some way? Could the hashed keyword value be reverse engineered to ascertain what the original users query was?

Is there hope for the future?

Visualising Googlebot Crawl With Excel

For most websites, search engines and more specifically Google represent a critical part of their traffic breakdown. It is common place to see Google delivering anywhere from 25% or over 80% of the traffic to different sized sites in many different verticals.

Matt Cutts was recently asked about what the most common SEO mistakes where and he lead off the list with the crawlability of a website. If Google can’t crawl through a website, it prohibits Google from indexing the content and will therefore have a serious impact on the discoverability of that content within Google search.

With the above in mind, it is important to understand how search engines crawl through a website. While it is possible to scan through log files manually, it isn’t very practical and it doesn’t provide an easy way to discover sections of your site that aren’t being crawled or are being crawled too heavily (spider traps) and this is where a heat map of crawl activity is useful:

Visualising Googlebot Crawl Activity With Excel & Conditional Formatting

In this article, we’ll briefly cover the following topics:

Microsoft Log Parser

Microsoft Log Parser is an old, little known general purpose utility to analyse a variety of log style formats, for which Microsoft describe it as:

Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®.

You tell Log Parser what information you need and how you want it processed. The results of your query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart. The world is your database with Log Parser.

The latest version of Log Parser, version 2.2, was last released in back in 2005 and is available as a 1.4MB MSI from the Microsoft Download Centre. Operating system compatibility is stated as being Windows 2000, Windows XP Professional Edition & Windows Server 2003 but I run it on Windows 7, which suggests to me that it’ll probably run on Windows Vista and maybe even Windows 8.

In case you missed the really important point above that makes Microsoft Log Parser a great little utility, it allows you to run SQL like statements against your log files. A simple and familiar exercise might be to find broken links within your own website or to find 404 errors from broken inbound links.

Gaining Access To Your Log Files

Depending on the type of website you’re running and what environment you run it in, getting access to your log files can be the single biggest hurdle in this endeavor but you just need to be patient and persevere.

If you have your own web hosting, it is likely that you’ll have access to your server log files via your web hosting control panel software such as Cpanel or Plesk. That doesn’t necessarily mean that your hosting has been configured to actually log website access, as a lot of people turn it off to save a little disk space.

If your hosting doesn’t have logging enabled currently, first port of call is configuring that as it is obviously a prerequisite to visualising Googlebot crawl activity through your websites. Once configured, depending on the size of your website and how important it is in Google’s eyes – you may need to wait 4-6 weeks to get sufficient data to understand how Googlebot is accessing your website.

Corporate websites will invariably have web traffic logging enabled as it is helpful for debugging and compliance reasons. Getting access to the log files might require an email or two to your IT department or maybe a phone call to a senior system administrator. You’ll need to explain to them why you want access to the log files, as it will normally take some time for them to either organise security access for you to access that part of your corporate network or they may need to download/transfer them from your external web hosting to a convenient place for you to access them from.

Organising Your Log Files

To get the most out of this technique, you’ll want access to as many weeks or months of log files as possible. Once you download them from your web hosting provider or your IT department provides access to the log files, place them all in the same directory for log analysis by Microsoft Log Parser.

Directory Showing Daily Web Server Logs Broken Into 100MB Incremental Files

Directory Showing Daily Web Server Logs Broken Into 100MB Incremental Files

As you can see in the image above, the log files for the server I was working with generates log files with a consistent naming convention per day and produces a new incremental file for every 100MB of access logs. Your web server will probably generate a different sequence of daily, weekly or monthly log files but you should be able to put all of them into a directory without any hassle.

Microsoft Log Parser Primer

Log Parser by Microsoft is a command line utility which accepts arguments in from the command prompt to instruct it how to perform the log analysis. In the examples below, I’ve passed in three arguments to Log Parser, e, i and the query itself but you can provide as many as you need to get the desired output.

Within the query itself the columns you’d SELECT are the column headings out of the log file, so my example below I have a column heading named cs-uri-stem representing the URL without the domain information. Open one of your log files in a text editor and review the headings in the first row of the log file to find out what the column headings are to use within your SELECT statement.

Just like a SQL query in a relational database, you need to specify where to select from which under normal circumstances is a SQL database table. Log Parser maintains that same idiom, except you can select from an individual log file, where you’d provide the file name or you can select from a group of log files identified by a pattern. In the examples I’ve used below, you can see that the FROM statement has ex*, which matches the pattern in the Organising Your Log Files section above.

As you’d expect, Log Parser provides a way to restrict the set of log records to analyse with a WHERE statement and it works exactly the same way it does in a traditional SQL database. You can join multiple statements together with brackets to provide precedence along with AND or OR statements.

Conveniently Microsoft Log Parser also provides aggregate functions like COUNT, MAX, MIN, AVG and many more. This in turn suggests that Log Parse also supports other related aggregate functionality like GROUP BY and HAVING, which it does in addition to ORDER BY and a raft of other more complex functionality.

Importantly for larger log analysis, Log Parser also supports storing the output of the analysis somewhere which can be achieved by using the INTO keyword after the SELECT statement as you can see in the examples below. If you use the INTO keyword, whatever the output of the SELECT statement will be stored into the file specified, whether it is a single value or a multi-column, multi-row table of data.

Microsoft provide a Windows help document with Log Parser, which is located in the installation directory and provides a lot of help about the various options and how to combine them to get the output that you need.

Now that the super brief Log Parser primer is over and done with, time to charge forward.

Identifying Googlebot Crawl Activity

While Microsoft Log Parser is an incredible utility, it has a limitation that a normal SQL database doesn’t – it does not support joining two or more tables or queries together on a common value. That means to get the data we need to perform a Googlebot crawl analysis, we’ll need to perform two queries and merge them in Microsoft Excel using a simple VLOOKUP.

Some background context so the Log Parser queries make sense below, the website that the log files are from uses a human friendly URL structure with descriptive words in the URLs in a directory like structure which end with a forward slash. While it doesn’t happen a lot on this site, I’m lower casing the URLs to consolidate crawl activity into fewer URLs to get a better sense of Googlebot’s activity as it crawls through the site. Similarly I am deliberately ignoring query string arguments for this particular piece of analysis to consolidate crawl activity into fewer URLs. If there is a lot of crawl activity around a group of simplified URLs, it’ll show up in the visualisation and be easier to query for the specifics later.

Next up, the queries themselves – open a Command prompt by going START->RUN and entering cmd. Change directory to where you’ve stored all of your log files. Microsoft Log Parser is installed in the default location on my machine but change that accordingly if needed.

Query: Find all URLs

“c:\program files\log parser 2.2\LogParser.exe” -e:5 -i:W3C “SELECT TO_LOWERCASE(cs-uri-stem), date, count(*) INTO URLs.csv FROM ex* WHERE cs-uri-stem like ‘%/’ GROUP BY TO_LOWERCASE(cs-uri-stem), date ORDER BY TO_LOWERCASE(cs-uri-stem)”

Query: Find all URLs that Googlebot accessed

“c:\program files\log parser 2.2\LogParser.exe” -e:5 -i:W3C “SELECT TO_LOWERCASE(cs-uri-stem), date, count(*) INTO googlebot.csv FROM ex* WHERE cs-uri-stem like ‘%/’ AND cs(User-Agent) LIKE ‘%googlebot%’ GROUP BY TO_LOWERCASE(cs-uri-stem), date ORDER BY TO_LOWERCASE(cs-uri-stem)”

You could be as specific with the user agent string as you like, I’ve been very broad above. If you felt it necessary, you could filter out fake Googlebot traffic by performing a reverse DNS lookup on the IP address to verify it is a legitimate Googlebot crawler per the recommendation from Google.

Microsoft Excel

Open both the CSV files output from the queries above. Add a new Excel Worksheet named “Googlebot” to the URLs.csv file and paste into it the contents of googlebot.csv. This will allow you to merge the two queries easily into a single sheet of data that you can generate the visualisation from.

VLOOKUP

Since the queries above result in more than one line per URL for each day they were accessed, a new column needs to be added to work as a primary key for the VLOOKUP. Insert a new column at column A and title it “PK” in both worksheets. In cell A2 in both worksheets, add the following function and copy it down for all rows in both worksheets:

=CONCATENATE(B2, C2)

The CONCATENATE function will join two strings together in Excel. In our instance we want to join together the URL and the date it was accessed, so that the VLOOKUP function can access the correct Googlebot daily crawl value.

Sort both Excel worksheets by the newly created PK column A->Z. Make sure this step is carried out, as a VLOOKUP function doesn’t work as you expect if the tables of data you’re looking up data from aren’t sorted.

Add a new column named Googlebot to your URLs worksheet and in the first cell we’re going to add a VLOOKUP function to fetch the number of times a given URL was crawled by Googlebot on a given date from the Googlebot worksheet:

=IFERROR(VLOOKUP(A2,Googlebot!$A$2:$D$8281, 4), 0)

The outer IFERROR says if there is an error with the VLOOKUP function, return a 0. This is helpful since not all URLs within the URLs worksheet have been accessed by Googlebot. The inner VLOOKUP function looks up the value for A2, the URL & date value you added earlier in the first column from the rows and columns of the Googlebot worksheet minus the column headings. If you’re not familiar with the $ characters in between the Excel cell references, they cause the range to remain static when the function is copied down the worksheet.

Visualising Googlebot Crawl Data Ready

The image above shows, left to right the URL with numeric date appended, actual URL, date the URL was crawled, number of times Googlebot crawled the URL and the total number of times the URL was accessed.

PivotTable

Microsoft Excel provides a piece of functionality named PivotTable, which essentially allows you to rotate or pivot your spreadsheet of information around a different point and perform actions on the pivoted information such as aggregate functions like sum, max, min or average.

In our example, we don’t need to perform calculations on the data – that was performed by Log Parser. Instead the pivot table is going to turn the date column from the URLs worksheet that has a unique date for each day within your log files and transform them by making each unique date a new column. For example, if you were analysing 30 days of crawl information, you’ll go from one column containing all 30 dates to having 30 columns representing each date.

Within the URLs worksheet, select the columns representing the URL, date and the number of times Googlebot crawled the URLs. Next click Insert from the Excel navigation and select Pivot Table (left most icon within the ribbon navigation in recent versions of Microsoft Office). Once selected, Excel will automatically select all rows and columns that you highlighed in the worksheet and pressing Ok will create a new worksheet with the pivot table in it ready for action.

Within the PivotTable Field List in the right column, place a check in each of the three columns of information imported into the pivot table. In the bottom of the right column, drag the fields around so that the date field is in the Column Labels section, URL is within the Row Labels section and the Googlebot crawl is within the Sum Values section. Initially Excel will default to using a count aggregate function, but needs to be updated to SUM by clicking the small down arrow to the right of the item, selecting Value Field Settings and picking SUM from the list.

Visualising Googlebot Crawl Excel Pivot Table Options

Visualisation

Now that the data has been prepared using the PivotTable functionality within Excel, we’re able to apply some sort of visual cue to the data to make it easier to understand what is happening. To solve that problem quickly and easily, we’re going to use Conditional Formatting that allows you to apply different visual cues to data based on the data itself.

Select the rows and columns that daily represent the daily crawl activity, don’t include the headings or the total column or it’ll skew the visualisation due to the large numbers in those columns. Once selected, click the Home primary navigation item and then Conditional Formatting, expand out Colour Scales and choose one you like. I chose the second item in the first row, as such URLs with lots of crawl activity will appear red or hot.

Tip
To increase the density of the visualisation in case you’ve chosen to visualise large date range, select the columns that represent the dates, right click Format Cells, then into the Alignment tab and set the text direction to 90 degrees or vertical.

Use the zoom functionality in the bottom right corner of Excel to zoom out if necessary and what you’re lead with is a heat map showing Googlebot crawl activity throughout the different URLs within your website over time.

Visualising Googlebot Crawl Activity With Excel & Conditional Formatting

Without a mechanism to visualise the crawl rate of Googlebot, it would be impossible to understand why the three URLs in the middle of the image were repeatedly crawled by Googlebot. Could this have been a surge in links off the back of a press release, maybe there was press coverage that didn’t link and that represents a fast, easily identifiable link building opportunity.

It is now dead simple to see what sections of your website aren’t getting crawled very often, what sections are getting crawled an appropriate amount and what sections could be burning up Googlebot crawl resources needlessly that could be spent crawling useful content in other sections of the website.

Go forth and plunder your web server logs!

Does Fast Web Hosting Improve Search Engine Rankings?

Everyone loves fast websites, it doesn’t matter if you’re a casual internet user or a power user – website performance matters. Research by Google, Microsoft, Yahoo! and many others have shown that fast websites have better user metrics, whether that be converting users into customers, performing more searches, higher usage patterns, increased order values and more. In addition to it providing benefits across the board, Google announced in April 2010 that website performance was now a ranking factor, such that slow loading websites could be ranked lower in the search results for a given query or conversely that fast websites could be ranked higher.

A Brief History

I live in Queensland, Australia; specifically on the sunny Gold Coast.

Since I started my personal blog back in 2004, I’ve always wanted my hosting to be fast for me personally. My desire for speed wasn’t born out of some crackpot web performance mantra, it was far simpler than that – I hate slow email. Having my web hosting in Australia guaranteed me that my personal email was going to be lightning fast and as a happy by product, my website screamed along as well.

In November 2012 I decided that I’d attempt to simplify my life, go hunting for a new web host that provided some additional features I was looking for and consolidate some of my hosting accounts. I still needed the bread and butter PHP and MySQL for WordPress but wanted some more options for some smaller projects I wanted to dabble with this year, so I added into the mix:

  • PostgreSQL
  • Python
  • Django
  • Ruby
  • Rails
  • SSH access
  • Large storage limits
  • No limits on the number of websites I could host
  • Reasonably priced

I begrudgingly accepted the fact that I’d probably end up using a United States web host and had to build a bridge over my soon to be 200ms+ ping. After a moderate amount of research, I signed up with Webfaction who tick all of the above boxes.

Crawling

Google is a global business with local websites such as www.google.com.au or www.google.de in most countries around the world. They service their massive online footprint by having dozens of data centers spread around the world. Users from different parts of the world tend to access Google services via their nearest data center to improve performance.

What a lot of people don’t realise is that despite the fact that Google have data centers around the world, Google’s web crawled named Googlebot, crawls the internet only from the data centers located in the United States of America.

For websites that are hosted in the far reaches of the world, this means it takes more time to crawl a web page compared to a web page hosted in the US due to the vast distances, network switching delays and the obvious limit of the speed of light through fibre optic cables. To illustrate that a little more clearly, the below table shows ping times from around the world to my Webfaction hosting located in Dallas, TX:

Dallas, U.S.A. 3ms
Florida, U.S.A. 32ms
Montreal, Canada 53ms
Chicago, U.S.A. 58ms
Amsterdam, Netherlands 114ms
London, United Kingdom 117ms
Groningen, Netherlands 133ms
Belgrade, Serbia 150ms
Cairo, Egypt 165ms
Sydney, Australia 195ms
Athens, Greece 199ms
Bangkok, Thailand 257ms
Hangzhou, China 267ms
New Delhi, India 318ms
Hong Kong, China 361ms

Based on the table above, it is clear that moving to US web hosting was going to have a significant impact on my load time as far as Googlebot was concerned. 

While still using my Australian web hosting, Google Webmaster Tools was reporting an average crawl time of around 900-1000ms, which you can see reflected on the left of the graph below and after the move to Webfaction it is now averaging around 300ms.

Google Webmaster Tools Crawl Stats Time Spent Downloading - www.lattimore.id.au

The next two graphs also taken from the Crawl Stats section of Google Webmaster Tools, show the number of pages crawled per day and the number of kilobytes downloaded per day respectively.

Google Webmaster Tools Crawl Stats Pages Crawled - www.lattimore.id.au

Google Webmaster Tools Crawl Stats Kilobytes Downloaded - www.lattimore.id.au

Google crawl the internet in descending PageRank order, which means the most important websites are crawled the most regularly and most comprehensively, while the least important are crawled less frequently and less comprehensively. Google prioritise their web crawl efforts because, although they have enormous amounts of computing power, it is still a finite resource that they need to consume judiciously.

By assigning a crawl budget to a website based on the perceived importance of each website, it provides Google a mechanism to spend an appropriate amount of resource crawling different sites. Most importantly for Google, it limits the amount of time they will spend crawling a less important websites with very large volumes of content, which might otherwise consume more than their fair share of resources. In simple terms, Google is willing to spend a fixed amount of resource crawling a website such that they could crawl 100 slow pages or 200 pages that loaded in half the time.

It’s been reported several times before that improving the load time of a website will have an impact on the number of pages that Google will crawl per day.  Interestingly that isn’t reflected in graphs above at all. I’d speculate that since my personal blog has a PageRank 4 on the home page, that provides ample crawl budget for the approximately 625 blog posts I’ve published over the years. While reducing the load time of a site would normally help, it didn’t have an impact on my personal blog because it isn’t suffering from crawling and indexing problems.

Indexing

Google Webmaster Tools Advanced Index Status - www.lattimore.id.au

Google announced in July 2012 a new report in Google Webmaster Tools named Index Status. The default view in the report shows the number of indexed URLs over time, however the Advanced tab shows additional detail that can highlight serious issues with a websites health at a glance such as the number of URLs blocked by robots.txt.

One particular attribute of the Advanced report is a time series of URLs named Not Selected, which are URLs that Google knows about on a given website but that aren’t being returned in the search results as a result of them being substantially similar to other URLs (duplicate content), have been redirected or canonicalised to another URL.

The graph above shows a huge reduction in the number of Not Selected URLs on the 11/11/2012, which was the week that I changed my hosting to Webfaction. It should be noted that I didn’t change anything when I moved my hosting, it was a simple backup, restore operation. That week Google Webmaster Tools reported nearly 1000 URL were removed out of Not Selected.

I’m a little confused as to what to make of that since the Total Indexed line in blue above didn’t increase at all. Is it possible that Google had indexed URLs as part of the Total Indexed which weren’t being returned for other reasons and somehow by improving the performance of my website it has meant that they aren’t part of Not Selected anymore?

While I’m grateful that Google Webmaster Tools now provides the Index Status report as it does provide some insight into how Google views a website, it could contain far more actionable data. For example, to help with debugging it would be great if you could download a list of URLs that Google is blocked from crawling via robots.txt or to download the list of URLs sitting within Not Selected.

Ranking

Below is a graph from the Search Queries report in Google Webmaster Tools from the start of November 2012. I changed my hosting on the 4 November and as mentioned in the Indexing section above, there was a significant change in the number of Not Selected URLs on 11 November.

It is subtle but the blue line in the graph below indicates a small uplift in impressions from around 19 November. Unfortunately I don’t have a screenshot showing the trend from before the start of November, however it consistently showed the same curves as you can see below before the small increase in impressions around 19 November.

Google Webmaster Tools Search Queries - www.lattimore.id.au

My personal blog doesn’t take a lot of traffic per month at around 4000-4500 visits, so it isn’t going to melt any servers. Though I thought it’d be useful to see if any of the above translated through to Google Analytics. The graph below shows a comparison of organic search referrals 30 days from 19 November lining up against the lift in impressions above against the prior 30 day window. Over that time span organic visits to my personal blog grew by around 9.5% or about 400 visits, which is nothing to sneeze at. Comparatively, the same time span for the last three years has fallen in traffic by 6%-8% each year – highlighting that a 9% growth is really quite good.

Google Analytics Organic Visits - www.lattimore.id.au

Opportunity

One small personal blog does not make a forgone conclusion but it does begin to get the cogs turning regarding opportunity. For businesses that are located in the USA, they are likely going to naturally use US web hosting and therefore latency isn’t really an issue.

However for businesses located large distances from the US, an opportunity could exist to get a small boost by either switching to US hosting or subscribing to a content delivery network with US points of presence.

A word of caution, Google uses a variety of factors to determine who or what audience in the world a websites content best services. If your website uses a country specific domain such as a .com.au, that website will automatically be more relevant to users searching from within Australia. If however you use a global top level domain such as a .com, Google uses a variety of factors to determine what audience around the world best fits that content. Among those factors is the location of the web hosting, such that a .com using UK web hosting would be more relevant to United Kingdom users. If your business primarily targets a country such as Australia and uses a gTLD, Google Webmaster Tools provides a mechanism to apply geotargeting to the website. By applying geotargeting to a gTLD and a small number of special case ccTLD, it will cause Google to associate that website to the geographic area of the world specified as if it was using the associated country specific domain.