Why Doesn’t Google Ignore Manipulative Links?

Recently Dan Petrovic published an excellent article titled The Great Link Paradox which discusses the changing behaviour of website owners regarding how they link from one page to another, the role Google is playing in this change and their fear mongering.

The key takeaway from the article is a call to action for Google:

At a risk of sounding like a broken record, I’m going to say it again, Google needs to abandon link-based penalties and gain enough confidence in its algorithms to simply ignore links they think are manipulative. The whole fear-based campaign they’re going for doesn’t really go well with the cute brand Google tries to maintain.

I’d like to talk about the following in the above quote, simply ignore links they think are manipulative but before that, let’s take a step back.

Fighting Web Spam

As Google crawls and indexes the internet, ingesting over 20 billion URLs daily, their systems identify and automatically take action on what they consider to be spam. In the How Google Search Works micro-site, Google published an interesting page about their spam fighting capabilities. Set out in the article are the different types of spam that they detect and take action on:

  1. cloaking and/or sneaky redirects
  2. hacked sites
  3. hidden text and/or keyword stuffing
  4. parked domains
  5. pure spam
  6. spammy free hosts and dynamic DNS providers
  7. thin content with little or no added value
  8. unnatural links from a site
  9. unnatural links to a site
  10. user generated spam

In addition to the above, Google also use humans as part of their spam fighting tools through the use of manual website reviews. If a website receives a manual review by a Google employee and after review it is deemed to have violated their guidelines, a manual penalty can be applied to the site.

Manual penalties come in two forms, site-wide or a partial penalty. The former obviously effects an entire website, all pages are going to be subject to the penalty, while the latter might effect a sub-folder or an individual page within the website.

What the actual impact of the penalty is varies as well, some penalties might see an individual page drop in rankings, an entire folder drop in rankings, an entire site drop in rankings, it might only affect non-brand keywords, it could be all keywords or it might cause Google to ignore a subset of inbound links to a site – there are a litany of options at Google’s disposal.

Devaluing Links

Now that Google can detect unnatural links from and to a given website, the next part of the problem is being able to devalue those links.

Google has had the capacity to devalue links and has been able to do so since at least January 2005 when they announced the rel=”nofollow” attribute in an attempt to curb comment spam.

For the uninitiated, normally when Google discovers a link from one page to another, they will calculate how much PageRank or equity should flow through that specific link to the linked URL. If a link as a rel=”nofollow” attribute applied, Google will completely ignore the link, as such no PageRank or equity flows to the linked URL and it does not impact organic search engine rankings.

In addition to algorithmically devaluing links explicitly marked with rel=”nofollow”, Google can devalue links via a manual action. For websites that have received a manual review, if Google doesn’t feel confident that the unnatural inbound links are deliberate or designed to manipulate the organic search results, they may devalue those inbound links without penalising the entire site.

Google Webmaster Tools Manual Action Partial Match Unnatural Inbound Links

Google Webmaster Tools: Manual Action – Partial Match For Unnatural Inbound Links

Why Aren’t Google Simply Ignoring Manipulative Links?

At this stage, it seems all of the ingredients exist:

  • can detect unnatural links from a site
  • can detect unnatural links to a site
  • can devalue links algorithmically via rel=”nofollow”
  • can devalue links via a manual penalty

With all of this technological capability, why don’t Google simple ignore or reduce the effect of any links that they deem to be manipulative? Why go to the effort of orchestrating a scare campaign around the impact of a good, bad or indifferent links? Why scare webmasters half to death about linking out to a relevant, third party websites such that their readers are disadvantaged because relevant links aren’t forthcoming?

The two obvious reasons that immediately come to mind:

  1. Google can’t identify unnatural links with enough accuracy
  2. Google doesn’t want to

Point 1 above doesn’t seem a likely candidate, the Google Penguin algorithm which rolled out in April 2012 was designed to target link profiles that were low quality, irrelevant or had over optimised link anchor text. If they are prepared to penalise a website by Google Penguin, it seems reasonable to assume that they have confidence in identifying unnatural links and taking action on them.

Point 2 at this stage remains the likely candidate, Google simply don’t want to flatly ignore all links that they determine are unnatural, whether by accident of poorly configured advertising, black hat link building tactics, over enthusiastic link building strategies or simply bad judgement.

What would happen if Google did simply ignore manipulative inbound links? Google would only count links that their algorithms determined were editorially earned. Search quality wouldn’t change, Google Penguin is designed to clean house periodically through algorithmic penalties and if Google simply ignored the very same links that Penguin targeted – those same websites wouldn’t haven’t risen to the top of the rankings only to get cut down by Penguin at a later date.

Google organic search results are meant to put forward the most relevant, best websites to meet a users query. Automatically ignoring manipulative links doesn’t change the search result quality but it also doesn’t provide any deterrent for spammers. With irrelevant links simply being ignored, a spammer is free to push their spam efforts into overdrive without consequence and Google wants their to be a consequence for deliberately violating their guidelines.

How effective was the 2005 addition of rel=”nofollow” in fighting comment spam by removing the reward of improved search rankings – no impact. Spam levels recorded by Akismet, the free spam detection company by Automattic, haven’t eased since they launched – in fact comment spam levels are growing at an alarming rate despite the fact that most WordPress blogs have rel=”nofollow” comment links. The parallels between removing the reward for comment spam and removing the reward of spammy links is striking and it didn’t work last time – why would Google expect it to work this time?

Why aren’t Google ignoring unnatural links automatically?

Google doesn’t like being manipulated, period.

Buffer Automatic Author Social Profile Detection

Buffer AppI use Buffer and their various extensions and integration’s while I’m browsing as a simple/easy way for me to share content. I love it, it’s a great tool in my opinion – highly recommend you give it a go if you haven’t already.

A little while ago, I shot the Buffer guys an email about a suggested feature, which they were good enough to respond to me on – awesome customer service. I thought I’d put it out in the public forum, maybe if it gets a little more attention it’ll make it off their ‘to do’ list onto the ‘must do’ list.

Before using Buffer, if I manually shared something – I’d try and find the social handle/profile for the author and include them in the share so they get notified such as the below if it was a tweet:

An excellent article about X by @author

One of the benefits of Buffer is that it allows me to share the content into multiple social networks in one go – awesome! However the shared snippet that goes into Twitter/LinkedIn defaults to the shared page’s <title> tag and you end up with a basic share containing a link and text description. For example, if you were to share my article about using hashed keywords instead of (not provided) keyword, it might look like this by default:

Using Hashed Keywords Instead Of (not provided) Keyword | Convergent Media http://buff.ly/S1MzQt

While you can customise the shared content, one of my primary reasons for using Buffer is about sharing it into multiple social networks and saving me time – as such I don’t spend a lot of time editing the default shared snippet. Typically I’d remove fluff, such as “| Convergent Media” in the above or rewrite it entirely if I didn’t feel it conveyed the message clearly enough but it’s typically a light touch approach.

As a result of my laziness (I’m just being honest), using Buffer has meant that I share into Twitter and LinkedIn (great) but what I share is a little less optimal than when I was only manually sharing into Twitter.

Suggested Feature To The Rescue

My feature suggestion for the Buffer guys was to look at the HTML and social widget configuration on a page that is being shared and see if they can determine the handle/profile of the author. If they can, do something smart, if they can’t – fall back to the default behaviour.

In an ideal world, if this was possible – when I share a piece of content Buffer would customise the share for each social network such that it’d use @author notation in Twitter and include the profile in LinkedIn, facebook amd Google+ if it ever arrives. Using this mechanism, the author of the content would be notified about the share in all relevant networks that Buffer could identify him/her in that the content was shared into.

This has a couple benefits:

  1. It helps the person sharing the content as the author is going to get notified of the share. Often shares that don’t include a social profile for attribution will sadly go unnoticed by the author. Having the author being aware of the share will help strike up relationships with other awesome folks – network building than you very much.
  2. The author is more likely to re-share my shared content with their followers as they get to show their followers other people love their product/service. A small amount of self horn tooting is okay in my book.
  3. It’s good for the person sharing the content as they get their name in front the authors followers. Just like in point one above, this exposure – especially if it is semi regular can lead to the author’s followers becoming your followers – superb.

Determining The Author’s Social Profiles

This isn’t going to work all the time, in many cases it might not be possible for Buffer to accurately determine the author’s social profiles for inclusion into the shared snippet but I think it’d be worth a shot.

Following are some ways I can think of that Buffer could determine the author:

  • Check if the Twitter tweet button is on the page and if they have customised the tweet text to something like “{postname} by @author”.
  • Following on from the above, they could go one step further (not sure if this would be as relevant) but many websites also use the ‘related’ attribute in the Twitter share button to suggest to people sharing the content who to follow immediately after sharing the content.
  • Check if the author has an author profile on the site, in which case it is likely that they’ll have linked off to their various social media profiles .
  • Check if there is an author bio associated to the article, which might also include links to their various social profiles
  • If Google+ sharing becomes an option in the future, Buffer could check for links to Google+ with authorship markup attached.
  • Crawl all social profiles you can find associated to the author (if none are for Twitter/LinkedIn/facebook directly as an example) and see if they list their ‘other’ social profiles within the social profiles you were able to crawl.

In the case of Twitter, there are official tweet text parsing libraries provided by Twitter in various languages and there are many open source implementations of the Twitter text parsing libraries available in different languages as well.

So there you have it, a feature upgrade for Buffer that I think would be really worth while as it helps authors and social sharers alike to a mutually beneficial gain.

Do you think it’s a good idea, should Buffer add it to their queue of work or am I mad?

Unbeatable SEO Tips – Online Retailer Roadshow

At the beginning of September, Reed Exhibitions contacted me about a public speaking opportunity at Online Retailer Roadshow in Brisbane which came about via Dan Petrovic of Dejan SEO.

The one day conference went off without a hitch and Reed Exhibitions should be commended on finding such as a great list of speakers for the event. I really enjoyed the presentations from Faye, Steve and Josh – they are all doing some really interesting stuff:

The suggested topic for my presentation was great SEO tips for 2014 that’d genuinely help the attendees run a better online business and hopefully make more money online.

Given that the conference was specifically about online retailing, I focused a lot of my tips and suggestions on taking care of common issues with ecommerce product websites, website performance so users don’t have to wait for the site to load, producing great content that outpaces your competitors, link building and a couple suggestions around leveraging rich snippets and advanced social sharing implementations such as using Twitter Cards for example.

This was my first public speaking engagement with an audience of over 100 other professionals. I was quite nervous listening to the other presenters and watching the clock ticking down to my start time but once I got onto the stage – that disappeared and I think I did quite well. Two things I’ve learned out of the experience, speaking to a time allocation is an art – I went over my 30 minute limit and there is an impressive amount of time & effort required to produce the incredible slide decks you see on slideshare – for which I am left wanting but will definitely improve upon for my next presentation.

Using Hashed Keywords Instead Of (not provided) Keyword

Google started providing encrypted search back in 2010 and while the connection between the user and Google was encrypted, Google were still passing the users search query through to websites. In October 2011, Google made a change whereby users logged into their Google Account on google.com would be automatically switched over to HTTPS and in March 2012, Google announced that they were rolling that same change out globally through all of their regional Google portals such as www.google.com.au.

Importantly, unlike the encrypted search product from Google released in 2010 that still passed the users search query through to the destination website, Google are not passing the users search query through to websites as of the changes rolled out in 2011 and subsequently in 2012.

(not provided) Keyword

Google (not provided) Keyword Growth

(not provided) Keyword Growth

The lack of the keyword information being passed through to the destination website manifests itself in web statistics products like Google Analytics with a pseudo-search term known as (not provided).

To provide a high level example of what is happening, if a website received 5000 visits from 5000 different users, each with unique search phrases and all users were using Google secure search – a product like Google Analytics will report all of those 5000 visits against a single (not provided) keyword and aggregate all of the individual user metrics against that one keyword.

In more specific terms, below are some of the issues faced not having search query data:

  • you won’t know how many unique search queries and their respective volumes are entering a site
  • you can’t analyse keyword level metrics like pages/visit, bounce rate, conversion rate
  • you can’t find pages competing with one another inside a site and providing a poor user experience
  • you can’t optimise a landing page based on the users keyword
  • you won’t be able to understand user search behaviour in terms of their research/buy cycle
  • you’ll lose the ability to understand how your brand, product and generic phrases are related to one another
  • you’ll lose the ability to understand how different devices play a role in your marketing efforts to know that the research/buy cycle is different
  • you can’t report on goal completions or goal funnel completion by keyword
  • you can’t report on transactions, average order value or revenue by keyword
  • attribution for a major percentage of a sites traffic is greatly impacted

Hashed Keywords

Google Analytics Hashed Keywords

Example Google Analytics Organic Keyword Report Using SHA-1 Hashing Function

I wondered long ago if Google might consider taking a small step back from their current stance and instead of sending no value for the query through to the destination website in the HTTP REFERER header that they might provide a unique hash for every keyword instead.

For those unaware, hashing algorithms take variable length inputs and output an associated, unique, fixed length output. There are a variety of different hashing functions available, but as an example of their use – SHA-1 is used in cryptography and is part of the security for HTTPS web traffic.

The important thing to understand about this idea, whether it is done through a hashing function or another mechanism, is that the goal would be to replace the users actual query with another unique value that doesn’t disclose or leak the users actual query for privacy reasons.

Using an approach like this isn’t going to address all of the issues raised in the bullet point list above or the longer list of issues the (not provided) keyword introduces, however it improves a businesses understanding of their website and their visitors behaviour without compromising a users right to privacy.

Unintended Side Effects

History will show that as we make advances in one area, often with only the best of intentions, that those best intentions are ultimately twisted, bent and adapted to drive some less than ideal outcomes.

The same can be seen with user privacy, the HTTP REFERER header was designed to help a website owner understand how users move through the internet at large and an individual website. When the HTTP specification was first developed, at the time I’m sure that the inventors didn’t imagine that in the future that simple concept was going to ultimately become a tool to attack a users privacy.

Now the question to ask would be, if Google were to take a couple of steps back from where they are currently and provide a hashed representation of the users query instead of no query data at all – could a website owner, opportunistic marketer or nefarious hacker misuse the hashed query against the user in some way? Could the hashed keyword value be reverse engineered to ascertain what the original users query was?

Is there hope for the future?

Social Media Isn’t Dead Yet

Recently Barry Adams wrote an article titled Social Media is Dead; Long Live SEO in which he puts forward the case that social media is a waste of time for most businesses and they should focus on what works. It should be noted that it isn’t Barry just making up sensational headlines, those comments are supported by research conducted by Forrester in late 2012 and also by Custora in 2013.

I’m not here to dispute that fact directly but I thought was worth throwing another discussion point into the melting pot for everyone to consider and that is, while the internet is a highly measurable place which marketers and businesses alike love, it does have limitations and one of those limitations is uniquely identifying a person.

While the technology wasn’t as advanced, the ability to identify an individual user 10 years ago was simpler – people had less frequent access to the internet and from fewer computers. Fast forward ten years and we are living in a multi-screen world, where an individual person might switch between phone, tablet, laptop, desktop, TV and more across the course of the day, all the while continuing what the person considers to be a single, unified experience.

All the different devices used by consumers today complicate the problem of uniquely identifying a person, as the unique identification is generally done through the use of browser cookies. That means that the same user viewing a website on their phone, tablet, laptop, desktop and TV are normally counted as separate users within web statistics software such as Google Analytics.

Barry replied to comment on his article where he mentioned that he has seen many different multi-channel attribution reports from Google Analytics that never register social media traffic sources in any significant way, even when looking at the assisted conversion report.

Google Analytics Multi-channel Funnel Assisted Conversions

Google Analytics Multi-channel Funnel Assisted Conversions Report

You’re mother would have told you never to believe everything you see on TV, read in the newspapers or view in Google Analytics – okay, I’ll concede the last point. What many don’t realise when seeing a headline from companies like Forrester or a neat table like the one above, is that it is increasingly difficult to measure the impact of different traffic sources end to end due to the browser cookie issue I briefly mentioned above.

The Difficultly In Measuring Social Media Impact

For the sake of discussion, we’ll focus on both facebook and Twitter as they are the most widely used social networks. It may or may not come as a surprise, but both of these social networks report over 50% of their usage is via mobile devices.

Imagine that you’re Forrester and you’re trying to compile research about the impact of social media on businesses. When over 50% of the usage of the two biggest social networks in the world are powered by mobile and mobile conversion rates are well below their desktop browser counterparts – that alone provides a reason why it’s hard to directly measure the impact of social media.

Now consider the absurd scenario where a user returns to the same website they visited on their mobile via facebook but this time on their computer via a brand query in Google search, that ultimately leads to a conversion. It looks like search earned the conversion and they did play a role but so did facebook, however because it was across two different devices – multi channel attribution within Google Analytics fails, even when looking at the assisted conversions.

Worthy Case Study Material

Recently Google announced a major upgrade to Google Analytics named Universal Analytics. One of the big changes with Universal Analytics is that you can provide a unique user identifier into the tracking and use the identifier across devices.

Google Analytics Universal Analytics

The case study I want to see from someone like Forrester is a collection of big businesses who implement Universal Analytics alongside a raft of user interface components throughout their sites designed to capture something unique about those users and across all devices.

As an example, a user views your website after a referral from facebook on their mobile but doesn’t convert. The website could ask the user to sign up for an account with an incentive or to join an email database.

Now that you’ve got a unique identifier for the user, you’re now in a position to track the impact of the facebook referral if the user happens to come back, either on the same device or a different device (tablet, laptop, computer, TV, ..) and purchases using the same unique identifier they provided on their mobile, such as their email address.

I don’t know if social media is thriving or dying as far as businesses are concerned but I know that we won’t have that answer until everyone gets a lot better at media attribution across the board.

Visualising Googlebot Crawl With Excel

For most websites, search engines and more specifically Google represent a critical part of their traffic breakdown. It is common place to see Google delivering anywhere from 25% or over 80% of the traffic to different sized sites in many different verticals.

Matt Cutts was recently asked about what the most common SEO mistakes where and he lead off the list with the crawlability of a website. If Google can’t crawl through a website, it prohibits Google from indexing the content and will therefore have a serious impact on the discoverability of that content within Google search.

With the above in mind, it is important to understand how search engines crawl through a website. While it is possible to scan through log files manually, it isn’t very practical and it doesn’t provide an easy way to discover sections of your site that aren’t being crawled or are being crawled too heavily (spider traps) and this is where a heat map of crawl activity is useful:

Visualising Googlebot Crawl Activity With Excel & Conditional Formatting

In this article, we’ll briefly cover the following topics:

Microsoft Log Parser

Microsoft Log Parser is an old, little known general purpose utility to analyse a variety of log style formats, for which Microsoft describe it as:

Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®.

You tell Log Parser what information you need and how you want it processed. The results of your query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart. The world is your database with Log Parser.

The latest version of Log Parser, version 2.2, was last released in back in 2005 and is available as a 1.4MB MSI from the Microsoft Download Centre. Operating system compatibility is stated as being Windows 2000, Windows XP Professional Edition & Windows Server 2003 but I run it on Windows 7, which suggests to me that it’ll probably run on Windows Vista and maybe even Windows 8.

In case you missed the really important point above that makes Microsoft Log Parser a great little utility, it allows you to run SQL like statements against your log files. A simple and familiar exercise might be to find broken links within your own website or to find 404 errors from broken inbound links.

Gaining Access To Your Log Files

Depending on the type of website you’re running and what environment you run it in, getting access to your log files can be the single biggest hurdle in this endeavor but you just need to be patient and persevere.

If you have your own web hosting, it is likely that you’ll have access to your server log files via your web hosting control panel software such as Cpanel or Plesk. That doesn’t necessarily mean that your hosting has been configured to actually log website access, as a lot of people turn it off to save a little disk space.

If your hosting doesn’t have logging enabled currently, first port of call is configuring that as it is obviously a prerequisite to visualising Googlebot crawl activity through your websites. Once configured, depending on the size of your website and how important it is in Google’s eyes – you may need to wait 4-6 weeks to get sufficient data to understand how Googlebot is accessing your website.

Corporate websites will invariably have web traffic logging enabled as it is helpful for debugging and compliance reasons. Getting access to the log files might require an email or two to your IT department or maybe a phone call to a senior system administrator. You’ll need to explain to them why you want access to the log files, as it will normally take some time for them to either organise security access for you to access that part of your corporate network or they may need to download/transfer them from your external web hosting to a convenient place for you to access them from.

Organising Your Log Files

To get the most out of this technique, you’ll want access to as many weeks or months of log files as possible. Once you download them from your web hosting provider or your IT department provides access to the log files, place them all in the same directory for log analysis by Microsoft Log Parser.

Directory Showing Daily Web Server Logs Broken Into 100MB Incremental Files

Directory Showing Daily Web Server Logs Broken Into 100MB Incremental Files

As you can see in the image above, the log files for the server I was working with generates log files with a consistent naming convention per day and produces a new incremental file for every 100MB of access logs. Your web server will probably generate a different sequence of daily, weekly or monthly log files but you should be able to put all of them into a directory without any hassle.

Microsoft Log Parser Primer

Log Parser by Microsoft is a command line utility which accepts arguments in from the command prompt to instruct it how to perform the log analysis. In the examples below, I’ve passed in three arguments to Log Parser, e, i and the query itself but you can provide as many as you need to get the desired output.

Within the query itself the columns you’d SELECT are the column headings out of the log file, so my example below I have a column heading named cs-uri-stem representing the URL without the domain information. Open one of your log files in a text editor and review the headings in the first row of the log file to find out what the column headings are to use within your SELECT statement.

Just like a SQL query in a relational database, you need to specify where to select from which under normal circumstances is a SQL database table. Log Parser maintains that same idiom, except you can select from an individual log file, where you’d provide the file name or you can select from a group of log files identified by a pattern. In the examples I’ve used below, you can see that the FROM statement has ex*, which matches the pattern in the Organising Your Log Files section above.

As you’d expect, Log Parser provides a way to restrict the set of log records to analyse with a WHERE statement and it works exactly the same way it does in a traditional SQL database. You can join multiple statements together with brackets to provide precedence along with AND or OR statements.

Conveniently Microsoft Log Parser also provides aggregate functions like COUNT, MAX, MIN, AVG and many more. This in turn suggests that Log Parse also supports other related aggregate functionality like GROUP BY and HAVING, which it does in addition to ORDER BY and a raft of other more complex functionality.

Importantly for larger log analysis, Log Parser also supports storing the output of the analysis somewhere which can be achieved by using the INTO keyword after the SELECT statement as you can see in the examples below. If you use the INTO keyword, whatever the output of the SELECT statement will be stored into the file specified, whether it is a single value or a multi-column, multi-row table of data.

Microsoft provide a Windows help document with Log Parser, which is located in the installation directory and provides a lot of help about the various options and how to combine them to get the output that you need.

Now that the super brief Log Parser primer is over and done with, time to charge forward.

Identifying Googlebot Crawl Activity

While Microsoft Log Parser is an incredible utility, it has a limitation that a normal SQL database doesn’t – it does not support joining two or more tables or queries together on a common value. That means to get the data we need to perform a Googlebot crawl analysis, we’ll need to perform two queries and merge them in Microsoft Excel using a simple VLOOKUP.

Some background context so the Log Parser queries make sense below, the website that the log files are from uses a human friendly URL structure with descriptive words in the URLs in a directory like structure which end with a forward slash. While it doesn’t happen a lot on this site, I’m lower casing the URLs to consolidate crawl activity into fewer URLs to get a better sense of Googlebot’s activity as it crawls through the site. Similarly I am deliberately ignoring query string arguments for this particular piece of analysis to consolidate crawl activity into fewer URLs. If there is a lot of crawl activity around a group of simplified URLs, it’ll show up in the visualisation and be easier to query for the specifics later.

Next up, the queries themselves – open a Command prompt by going START->RUN and entering cmd. Change directory to where you’ve stored all of your log files. Microsoft Log Parser is installed in the default location on my machine but change that accordingly if needed.

Query: Find all URLs

“c:\program files\log parser 2.2\LogParser.exe” -e:5 -i:W3C “SELECT TO_LOWERCASE(cs-uri-stem), date, count(*) INTO URLs.csv FROM ex* WHERE cs-uri-stem like ‘%/’ GROUP BY TO_LOWERCASE(cs-uri-stem), date ORDER BY TO_LOWERCASE(cs-uri-stem)”

Query: Find all URLs that Googlebot accessed

“c:\program files\log parser 2.2\LogParser.exe” -e:5 -i:W3C “SELECT TO_LOWERCASE(cs-uri-stem), date, count(*) INTO googlebot.csv FROM ex* WHERE cs-uri-stem like ‘%/’ AND cs(User-Agent) LIKE ‘%googlebot%’ GROUP BY TO_LOWERCASE(cs-uri-stem), date ORDER BY TO_LOWERCASE(cs-uri-stem)”

You could be as specific with the user agent string as you like, I’ve been very broad above. If you felt it necessary, you could filter out fake Googlebot traffic by performing a reverse DNS lookup on the IP address to verify it is a legitimate Googlebot crawler per the recommendation from Google.

Microsoft Excel

Open both the CSV files output from the queries above. Add a new Excel Worksheet named “Googlebot” to the URLs.csv file and paste into it the contents of googlebot.csv. This will allow you to merge the two queries easily into a single sheet of data that you can generate the visualisation from.

VLOOKUP

Since the queries above result in more than one line per URL for each day they were accessed, a new column needs to be added to work as a primary key for the VLOOKUP. Insert a new column at column A and title it “PK” in both worksheets. In cell A2 in both worksheets, add the following function and copy it down for all rows in both worksheets:

=CONCATENATE(B2, C2)

The CONCATENATE function will join two strings together in Excel. In our instance we want to join together the URL and the date it was accessed, so that the VLOOKUP function can access the correct Googlebot daily crawl value.

Sort both Excel worksheets by the newly created PK column A->Z. Make sure this step is carried out, as a VLOOKUP function doesn’t work as you expect if the tables of data you’re looking up data from aren’t sorted.

Add a new column named Googlebot to your URLs worksheet and in the first cell we’re going to add a VLOOKUP function to fetch the number of times a given URL was crawled by Googlebot on a given date from the Googlebot worksheet:

=IFERROR(VLOOKUP(A2,Googlebot!$A$2:$D$8281, 4), 0)

The outer IFERROR says if there is an error with the VLOOKUP function, return a 0. This is helpful since not all URLs within the URLs worksheet have been accessed by Googlebot. The inner VLOOKUP function looks up the value for A2, the URL & date value you added earlier in the first column from the rows and columns of the Googlebot worksheet minus the column headings. If you’re not familiar with the $ characters in between the Excel cell references, they cause the range to remain static when the function is copied down the worksheet.

Visualising Googlebot Crawl Data Ready

The image above shows, left to right the URL with numeric date appended, actual URL, date the URL was crawled, number of times Googlebot crawled the URL and the total number of times the URL was accessed.

PivotTable

Microsoft Excel provides a piece of functionality named PivotTable, which essentially allows you to rotate or pivot your spreadsheet of information around a different point and perform actions on the pivoted information such as aggregate functions like sum, max, min or average.

In our example, we don’t need to perform calculations on the data – that was performed by Log Parser. Instead the pivot table is going to turn the date column from the URLs worksheet that has a unique date for each day within your log files and transform them by making each unique date a new column. For example, if you were analysing 30 days of crawl information, you’ll go from one column containing all 30 dates to having 30 columns representing each date.

Within the URLs worksheet, select the columns representing the URL, date and the number of times Googlebot crawled the URLs. Next click Insert from the Excel navigation and select Pivot Table (left most icon within the ribbon navigation in recent versions of Microsoft Office). Once selected, Excel will automatically select all rows and columns that you highlighed in the worksheet and pressing Ok will create a new worksheet with the pivot table in it ready for action.

Within the PivotTable Field List in the right column, place a check in each of the three columns of information imported into the pivot table. In the bottom of the right column, drag the fields around so that the date field is in the Column Labels section, URL is within the Row Labels section and the Googlebot crawl is within the Sum Values section. Initially Excel will default to using a count aggregate function, but needs to be updated to SUM by clicking the small down arrow to the right of the item, selecting Value Field Settings and picking SUM from the list.

Visualising Googlebot Crawl Excel Pivot Table Options

Visualisation

Now that the data has been prepared using the PivotTable functionality within Excel, we’re able to apply some sort of visual cue to the data to make it easier to understand what is happening. To solve that problem quickly and easily, we’re going to use Conditional Formatting that allows you to apply different visual cues to data based on the data itself.

Select the rows and columns that daily represent the daily crawl activity, don’t include the headings or the total column or it’ll skew the visualisation due to the large numbers in those columns. Once selected, click the Home primary navigation item and then Conditional Formatting, expand out Colour Scales and choose one you like. I chose the second item in the first row, as such URLs with lots of crawl activity will appear red or hot.

Tip
To increase the density of the visualisation in case you’ve chosen to visualise large date range, select the columns that represent the dates, right click Format Cells, then into the Alignment tab and set the text direction to 90 degrees or vertical.

Use the zoom functionality in the bottom right corner of Excel to zoom out if necessary and what you’re lead with is a heat map showing Googlebot crawl activity throughout the different URLs within your website over time.

Visualising Googlebot Crawl Activity With Excel & Conditional Formatting

Without a mechanism to visualise the crawl rate of Googlebot, it would be impossible to understand why the three URLs in the middle of the image were repeatedly crawled by Googlebot. Could this have been a surge in links off the back of a press release, maybe there was press coverage that didn’t link and that represents a fast, easily identifiable link building opportunity.

It is now dead simple to see what sections of your website aren’t getting crawled very often, what sections are getting crawled an appropriate amount and what sections could be burning up Googlebot crawl resources needlessly that could be spent crawling useful content in other sections of the website.

Go forth and plunder your web server logs!

Remarketing With Gmail Search Field Trial

The following article describes how to take advantage of changes in search behaviour across various Google products to provide free remarketing to potential customers who are shown an interest in your product or service.

Google+ has been an incredible source of inspiration for me since it came about. The number of really intelligent conversations I’ve read or been part of has been amazing. What I’m about to describe is the result of such a discussion started by Dan Manahan when he asked if anyone had tried to leverage Google Drive now that it was part of the personalised search experience.

Before jumping into how it all works, a quick bit of background about Google personalised search.

Search Plus Your World

Google began personalising the search results in 2007 when it started leveraging a users search history. In 2009 Google announced a product called social search, which used signed in users social connections through various social networks to help find higher quality, more relevant information from within your greater network of friends online.

Fast forward to January 2012 and Google announced Search Plus Your World (SPYW), as the next major evolution in personalising of the search results. The idea behind SPYW is simple, Google want to surface as much contextually relevant information about a users query from as many different sources as possible.

Search Plus Your World currently supports three types of personalisation:

  1. Personal Results
  2. Profiles in Search
  3. People & Pages

As an example of what Google Search Plus Your World can do, if a person has uploaded photos to Google+ of their pet, a search for the pet name in Google will return an array of photos which include those personal pet photos alongside more generic images that Google thinks are relevant for that query. It could also include any Google+ posts from that person or their network that are relevant to the query. Google+ profiles will be shown directly in the search results, allowing a user to follow them quickly and similarly generic queries such as [music] would yield suggestions for people or pages to follow surrounding that topic.

Gmail Search Field Trial

During August 2012 Google announced and opened up a limited beta feature named Gmail Search Field Trial with little fanfare. The goal of the Gmail Search Field Trial is to remove the need for users to have to remember where to search for something.

Currently a user who is a heavy Google product user needs to search within each of the different Google products for resources that are of interest, for instance searching in Gmail for an order confirmation from Amazon, checking what the weekly sales were in a Google Spreadsheet and so on.

To address this issue, users who signed up for the Gmail Search Field Trial literally see information from various Google products in several different search boxes throughout Google’s vast product offering. For example searching for [amazon] would show a user their Amazon emails in the right hand side bar of the Google, [my flights] would show you your upcoming flights in great detail, [my events] would access Google Calendar and more. Searching within Gmail would yield results from Google Drive such as documents, spreadsheets and so on. In the image below you can see two Google Spreadsheets showing up in the Gmail search box after searching for [30 day].

Gmail Search Field Trial: 30 Day Search

A search in Gmail for [30 day] showing two relevant Google Spreadsheets in the search results

Remarketing

Remarketing allows an advertiser to show ads to users after they’ve had some amount of contact with the advertiser. Consider a user doing research for a holiday many week or months in advance of actually booking the holiday itself. After the user has been to Holiday Website A but not purchased the holiday, Holiday Website A could use remarketing to show those users ads as they browse the internet.

There are many different forms of remarketing available on the market but they all fundamentally rely on third party browser cookies as a mechanism to identify an individual user after they’ve left the advertisers website and are browsing elsewhere on the internet.

Google provide remarketing options through Google AdWords, there are companies built solely around remarketing such as AdRoll and more recently facebook entered the fray with their own product as well.

It is worth noting that while the internet has historically made a big deal when Google updates their terms of service, Search Plus Your World and Gmail Search Field Trial are the kinds of services that Google can provide their customers by reducing the number of terms of service and allowing customer data to seamlessly flow across or between different Google products.

How To Use Google Remarketing Without Paying For It

Now that you know that information stored within various Google products can show up for relevant queries by a user in various Google products, how do you leverage that to help your business? The answer, get your businesses content into your potential customers Google Account.

A practical example might be in order to really help crystallise the idea.

Scenario

Imagine an every day bank that provides home loans. Home loans are a big deal for most people and not the kind of decision people leap into, they take a lot of time to think through, are well researched and the purchase cycle could be several months.

To help keep the banks products and services in the users mind, they might provide various PDF documents for the user to download, such as comparisons of the various home loan products that they provide.

Traditionally a website would provide a call to action to download a PDF. However a savvy marketer might include an additional call to action of “Save to Google Drive” and thus provide a easy and natural way of getting the content into a users Google Account.

As soon as the user saves the home loan comparison PDF to their Google Drive, subsequent searches related to the content within the PDF will trigger the PDF to show up in various places throughout Google products as outlined above.

Reducing Friction

Google have done a lot of the hard work for marketers by providing useful entry points to Google Drive. For example, if you wanted to link to a research paper by Google describing their distributed storage system known as Bigtable - you can link to Google Drive with a URL such as:

http://docs.google.com/viewer?url=http%3A%2F%2Fresearch.google.com%2Farchive%2Fbigtable-osdi06.pdf

When the user clicks the link, the document will be opened into Google Drive and provide the user with a 1-click option to save the document to their Google Drive storage. Of course you don’t have to do this with only PDF documents, Google Drive supports many different file types that would be useful to the bank in distributing their content and to a user, such as a home loan repayment calculator in Microsoft Excel.

Measuring For Success

Google Drive uses HTTPS or Secure Sockets Layer (SSL) to encrypt the connection between the user and Google, just like internet banking or ecommerce stores do. This means that if a PDF document contained a hyperlink back to the bank website in the example above, no referral data will be leaked and the bank website won’t know the click originated via someone who stored the document in Google Drive.

To get around that problem, include links within the PDF documents that contain campaign tracking from your favourite web analytics package such as Google Analytics.

Since we’re studious marketers and want to measure the benefits of our efforts, I’d recommend using two near identical copies of a given PDF:

  1. Standard PDF with standard campaign tracking
  2. Standard PDF with Google Drive campaign tracking

The standard call to action within the website links to file number one above, while the “Save to Google Drive” option links to file number two.

To address duplicate content issues with the two nearly identical versions of each PDF document, ideally a canonical X-Robots-Tag HTTP response header would be implemented from the Google Drive file directed at the standard version. This will guarantee that the Google Drive version of the PDF documents won’t show up in Google searches and that any link equity that the Google Drive PDF documents might accrue will be transferred to the standard PDF document.

Understanding Purchase Cycles

Different products purchased online have different length sales cycles. As consumers move through the different steps in the purchase cycle, their behaviour will change as they become more familiar with the product they are researching or looking to buy.

Throughout this journey, consumer search behaviour also changes; at the start of the process users might search for very broad terms like [buy new house], that might morph into [buy new house with swimming pool in Melbourne]. Once the user knows what type of house they want and the approximate price, they’ll start searching for [mortgage repayment calculators] which will lead them into researching mortgages and home loans at large, first broadly but later with very specific requirements.

Businesses that have their online content development in tune with this varied and changing research behaviour could also use remarketing through Gmail Search Field Trial to understand where their content fits into the equation by including a date into each link. Google Analytics campaign tracking might be configured with the following options:

  • Source: google
  • Campaign: Google Drive Home Loan Comparison 20130315
  • Medium: retargeting

When reviewing campaign traffic to their site through Google Analytics, analysts will see dates within the reporting window from the past. Some simple math and suddenly the bank will know that a certain collection of PDF documents are useful to potential clients eight weeks before they make an online enquiry but are not helping to close the deal within 2 weeks for instance.

This obviously requires more work than simply adding in campaign tracking to set and forget but for certain verticals, this may well be worth the effort. The bank in the example might edit the PDFs keeping the same filename, once a fortnight to update the campaign tracking dates.

Content Development

Since Gmail Search Field Trial provides an avenue to keep your product front and centre when users are actively looking for your services, it makes sense that as a marketer you capitalise on that exposure wherever possible.

This presents an interesting exercise to leverage existing data from existing sources such as paid search campaigns or web analytics to understand user intent and lead time before micro and macro level goals within their website.

Building out custom multi-channel funnels within Google Analytics for different organic and paid search campaigns will help identify where in the purchase cycle a business should try getting their content into their potential customers Google Accounts for continued exposure.

Remember, just like standard Google search has best practices – providing maximum exposure through Gmail Search Field Trial has its own rules. For example the title of a Google Spreadsheet is critical to it showing up, so busineses should test adding different documents into Google Drive with different settings to understand how Google indexes those for search.

Conclusion

If you’re business already has a collection of non-HTML content for users to consume, check to see if the current format of those documents is supported by Google Drive. If they are, follow the steps above and see if your business can benefit from passive remarketing through Gmail Search Field Trial. If you don’t have non-HTML content and you’re business has a medium to long sales cycle, now would be a great time to start considering what resources you could develop that would slot nicely into Google Drive that will help your potential customers, which could lead to additional sales and exposure in the future.

Google Accounts Have Doubled In Under 12 Months

Why would anyone care about how many Google Accounts exist in the world? It turns out that the number of active Google Accounts has knock on effects to website owners and internet marketing at large right around the world.

Many don’t realise it, however when you perform a search for [nike running shoes] on Google and most other search engines, click through to a website, the website you visit knows that you searched for [nike running shoes]. This isn’t anything malicious, deceptive or devious that Google or most other search engines are doing, it is part of how the internet has worked since the dawn of time.

Website owners can use the search phrase that a user typed to find their site to better understand their customers needs, improve their website content, tailor the user experience based on the phrase the user typed in, understand how different advertising interacts with one another and much much more.

Unfortunately, there is a prerequisite for the users query to pass across to the website in question – the user needs to have done a search on http://www.google.com and not https://www.google.com. Notice the addition of the letter s in https in the second Google URL, that signifies that Google is using Secure Socket Layer (SSL) to encrypt the connection between the user and Google, just like what internet banking websites use.

Google started providing encrypted search back in 2010 and while the connection between the user and Google was encrypted, through technical jiggery-pokery Google were still passing the users search query through to websites. In October 2011, Google made a change whereby users logged into their Google Account on google.com would be automatically switched over to HTTPS. In March 2012, Google announced that they were rolling that same change out globally through all of their regional Google portals such as www.google.com.au. Importantly, unlike the encrypted search product from Google that was driven by user security, the most recent changes are about user privacy and as such, Google isn’t using technical jiggery-pokery to pass the users search query through to websites and instead they get no keyword data at all.

Enter the unassuming Google Account, it provides a pathway into a vast array of Google products. Once a user logs into their Google Account, the user will remained signed into their Google Account until they specifically log themselves out of their Google Account. As such, as more and more Google Accounts are created, more and more users begin using Google secure search by default and website owners and internet marketers receive less and less search query data to help improve their websites and user experience with.

Microsoft Internet Explorer – Measurement Tool Of Choice

Microsoft Internet Explorer LogoMeasuring how fast Google Account’s are growing without being on the inside of Google is a little tricky. No one network provider or ISP can see all traffic on the internet, which means that anytime someone outside of Google makes a claim about the size of Google or any other business on the internet – it is a guess, educated maybe but a guess none the less. Compounding that fact is that if a user has a Google Account, there is a good chance that they’ll be logged into it – which means that the users internet traffic is encrypted between them and Google, making it a little harder again for network providers or third parties to provide an accurate figure.

Under normal circumstances I’d be one of the first to knock Microsoft Internet Explorer as a browser, it is slow to load, slow to render, not that great on memory consumption, is prone to crashing, most versions aren’t web standards compliant and it doesn’t provide an ecosystem of plugins in the same way that other more modern browsers like Firefox or Chrome do. However I’m actually happy that Internet Explorer is a slow moving beast in this instance, as it makes measuring how fast Google Account are growing possible.

When Google announced that users signed into their Google Account would use HTTPS by default, it signaled a changing of the tides; Google was going to begin moving more and more Google products over to HTTPS. It was now but a matter of time before other businesses and vendors began using Google secure search around the world.

Internet marketers worst fears were realised when Firefox 14 announced that it’d moved the search box over to Google secure search in July 2012. Later Apple released iOS6 in September 2012 which also defaulted the Safari browser to use Google secure search by default. Most recently, Google Chrome announced that when version 25 is released shortly it will move searches in the omnibox over to Google secure search as well.

Important among the announcements above, each of those browsers are using Google secure search whether the user is logged into their Google Account or not. As such, it isn’t possible to discern through web analytics packages whether a Firefox user is logged into their Google Account or if they’ve simply performed a search in the latest version of Firefox.

Fortunately, Microsoft haven’t enabled Google secure search by default yet, which makes it a good measurement tool. That doesn’t mean that some users of Internet Explorer don’t manually use the HTTPS version of Google search, which  would skew the numbers a little but I don’t think it’d be significant enough to break the trend.

Astute readers may be wondering why not produce a complex analysis of different browsers and browser versions, trying to capture the largest percentage of traffic to perform the analysis on? The answer for that is straight forward, the growth rate of Google Accounts in browsers other than Internet Explorer is even higher, which I put down to the users being more internet savvy due to the fact that they aren’t using Internet Explorer and probably have a higher likelihood to have a Google Account as a by product of that. As such, as Firefox and Chrome continue to eat market share from Internet Explorer – it is reasonable to assume that the growth rate will accelerate even faster.

Measuring Google Account Growth With Google Analytics

What I’ve proposed below won’t tell you how many Google user accounts exist as an absolute number but it does provide a guide as to how fast Google Accounts might be growing in the wild.

To do this you’ll want to create two Google Analytics Advanced Segments:

    1. Source Google, Medium Organic, Browser Internet Explorer
      Google Analytics Advanced Segment: Source Google, Medium Organic, Browser Internet Explorer
    2. Source Google, Medium Organic, Browser Internet Explorer, Keyword (not provided)
      Google Analytics Advanced Segment: Source Google, Medium Organic, Browser Internet Explorer, Keyword (not provided)

Enable both of the newly created Google Analytics Advanced Segments, pick a date range from January 2012 until present and use weekly data points to smooth out the graph a little.

Google Analytics Google Account Growth Advanced Segments

Use the Google Analytics Export feature and export the current view into a format you’re happy to work with in your spreadsheet application, I chose CSV but use whatever works for you. The way the data is exported from Google Analytics isn’t quite what we need, with a row for each weekly data point per advanced segment:

Google Account Growth Google Analytics Raw Export

In the below screenshot you can see I’ve done three changes to help with data formatting:

  1. used a function to copy the values from the “Google, Organic, IE” rows into a new column
  2. added a new column showing the percentage of Google, Organic, Internet Explorer, (not provided) against Google, Organic, Internet Exploer
  3. filtered the Segment column and turned off Google, Organic, IE since the values are now in a new column thanks to point 1 above

Google Account Growth Excel

Highlight the two Google Analytics Advanced Segment columns and the percentage, chart them in Excel and you’ll quickly see the growth rate of Google Accounts for the audience relevant to that particular website.

In the graph below the purple line represents the percentage of (not provided) keyword against the total organic visits. You can see that in March 2012 when Google rolled out Google secure search globally that the percentage was at around 7% and toward the end of the 2012 and into 2013 you can see that it has climbed over 14%.

Google Account Growth 2012

Does the above graph showing the growth of Google Accounts mean that Google Accounts have actually doubled, no it doesn’t directly. It could also mean that the total number of Google Accounts remained unchanged and twice as many people started using their Google Account. Of course bother of those scenarios are equally unlikely and more likely is the steady growth of (not provided) above represents a combination of both Google Account growth and potentially that interface changes and the push of Google+ is driving higher levels of Google Account usage.

If the trend above holds over 2013, website owners and internet marketers are set to lose more keyword data from their web analytics packages through natural growth of Google Accounts. Of course rate at which the growth in Google Accounts impinges internet marketing efforts will be insignificant compared to Google Chrome switching over to Google secure search on the next update and the inevitability of Internet Explorer receiving an upgrade in a future service pack that uses Google secure search by default.

Do you have a no keyword internet marketing analysis strategy in place yet?

Does Fast Web Hosting Improve Search Engine Rankings?

Everyone loves fast websites, it doesn’t matter if you’re a casual internet user or a power user – website performance matters. Research by Google, Microsoft, Yahoo! and many others have shown that fast websites have better user metrics, whether that be converting users into customers, performing more searches, higher usage patterns, increased order values and more. In addition to it providing benefits across the board, Google announced in April 2010 that website performance was now a ranking factor, such that slow loading websites could be ranked lower in the search results for a given query or conversely that fast websites could be ranked higher.

A Brief History

I live in Queensland, Australia; specifically on the sunny Gold Coast.

Since I started my personal blog back in 2004, I’ve always wanted my hosting to be fast for me personally. My desire for speed wasn’t born out of some crackpot web performance mantra, it was far simpler than that – I hate slow email. Having my web hosting in Australia guaranteed me that my personal email was going to be lightning fast and as a happy by product, my website screamed along as well.

In November 2012 I decided that I’d attempt to simplify my life, go hunting for a new web host that provided some additional features I was looking for and consolidate some of my hosting accounts. I still needed the bread and butter PHP and MySQL for WordPress but wanted some more options for some smaller projects I wanted to dabble with this year, so I added into the mix:

  • PostgreSQL
  • Python
  • Django
  • Ruby
  • Rails
  • SSH access
  • Large storage limits
  • No limits on the number of websites I could host
  • Reasonably priced

I begrudgingly accepted the fact that I’d probably end up using a United States web host and had to build a bridge over my soon to be 200ms+ ping. After a moderate amount of research, I signed up with Webfaction who tick all of the above boxes.

Crawling

Google is a global business with local websites such as www.google.com.au or www.google.de in most countries around the world. They service their massive online footprint by having dozens of data centers spread around the world. Users from different parts of the world tend to access Google services via their nearest data center to improve performance.

What a lot of people don’t realise is that despite the fact that Google have data centers around the world, Google’s web crawled named Googlebot, crawls the internet only from the data centers located in the United States of America.

For websites that are hosted in the far reaches of the world, this means it takes more time to crawl a web page compared to a web page hosted in the US due to the vast distances, network switching delays and the obvious limit of the speed of light through fibre optic cables. To illustrate that a little more clearly, the below table shows ping times from around the world to my Webfaction hosting located in Dallas, TX:

Dallas, U.S.A. 3ms
Florida, U.S.A. 32ms
Montreal, Canada 53ms
Chicago, U.S.A. 58ms
Amsterdam, Netherlands 114ms
London, United Kingdom 117ms
Groningen, Netherlands 133ms
Belgrade, Serbia 150ms
Cairo, Egypt 165ms
Sydney, Australia 195ms
Athens, Greece 199ms
Bangkok, Thailand 257ms
Hangzhou, China 267ms
New Delhi, India 318ms
Hong Kong, China 361ms

Based on the table above, it is clear that moving to US web hosting was going to have a significant impact on my load time as far as Googlebot was concerned. 

While still using my Australian web hosting, Google Webmaster Tools was reporting an average crawl time of around 900-1000ms, which you can see reflected on the left of the graph below and after the move to Webfaction it is now averaging around 300ms.

Google Webmaster Tools Crawl Stats Time Spent Downloading - www.lattimore.id.au

The next two graphs also taken from the Crawl Stats section of Google Webmaster Tools, show the number of pages crawled per day and the number of kilobytes downloaded per day respectively.

Google Webmaster Tools Crawl Stats Pages Crawled - www.lattimore.id.au

Google Webmaster Tools Crawl Stats Kilobytes Downloaded - www.lattimore.id.au

Google crawl the internet in descending PageRank order, which means the most important websites are crawled the most regularly and most comprehensively, while the least important are crawled less frequently and less comprehensively. Google prioritise their web crawl efforts because, although they have enormous amounts of computing power, it is still a finite resource that they need to consume judiciously.

By assigning a crawl budget to a website based on the perceived importance of each website, it provides Google a mechanism to spend an appropriate amount of resource crawling different sites. Most importantly for Google, it limits the amount of time they will spend crawling a less important websites with very large volumes of content, which might otherwise consume more than their fair share of resources. In simple terms, Google is willing to spend a fixed amount of resource crawling a website such that they could crawl 100 slow pages or 200 pages that loaded in half the time.

It’s been reported several times before that improving the load time of a website will have an impact on the number of pages that Google will crawl per day.  Interestingly that isn’t reflected in graphs above at all. I’d speculate that since my personal blog has a PageRank 4 on the home page, that provides ample crawl budget for the approximately 625 blog posts I’ve published over the years. While reducing the load time of a site would normally help, it didn’t have an impact on my personal blog because it isn’t suffering from crawling and indexing problems.

Indexing

Google Webmaster Tools Advanced Index Status - www.lattimore.id.au

Google announced in July 2012 a new report in Google Webmaster Tools named Index Status. The default view in the report shows the number of indexed URLs over time, however the Advanced tab shows additional detail that can highlight serious issues with a websites health at a glance such as the number of URLs blocked by robots.txt.

One particular attribute of the Advanced report is a time series of URLs named Not Selected, which are URLs that Google knows about on a given website but that aren’t being returned in the search results as a result of them being substantially similar to other URLs (duplicate content), have been redirected or canonicalised to another URL.

The graph above shows a huge reduction in the number of Not Selected URLs on the 11/11/2012, which was the week that I changed my hosting to Webfaction. It should be noted that I didn’t change anything when I moved my hosting, it was a simple backup, restore operation. That week Google Webmaster Tools reported nearly 1000 URL were removed out of Not Selected.

I’m a little confused as to what to make of that since the Total Indexed line in blue above didn’t increase at all. Is it possible that Google had indexed URLs as part of the Total Indexed which weren’t being returned for other reasons and somehow by improving the performance of my website it has meant that they aren’t part of Not Selected anymore?

While I’m grateful that Google Webmaster Tools now provides the Index Status report as it does provide some insight into how Google views a website, it could contain far more actionable data. For example, to help with debugging it would be great if you could download a list of URLs that Google is blocked from crawling via robots.txt or to download the list of URLs sitting within Not Selected.

Ranking

Below is a graph from the Search Queries report in Google Webmaster Tools from the start of November 2012. I changed my hosting on the 4 November and as mentioned in the Indexing section above, there was a significant change in the number of Not Selected URLs on 11 November.

It is subtle but the blue line in the graph below indicates a small uplift in impressions from around 19 November. Unfortunately I don’t have a screenshot showing the trend from before the start of November, however it consistently showed the same curves as you can see below before the small increase in impressions around 19 November.

Google Webmaster Tools Search Queries - www.lattimore.id.au

My personal blog doesn’t take a lot of traffic per month at around 4000-4500 visits, so it isn’t going to melt any servers. Though I thought it’d be useful to see if any of the above translated through to Google Analytics. The graph below shows a comparison of organic search referrals 30 days from 19 November lining up against the lift in impressions above against the prior 30 day window. Over that time span organic visits to my personal blog grew by around 9.5% or about 400 visits, which is nothing to sneeze at. Comparatively, the same time span for the last three years has fallen in traffic by 6%-8% each year – highlighting that a 9% growth is really quite good.

Google Analytics Organic Visits - www.lattimore.id.au

Opportunity

One small personal blog does not make a forgone conclusion but it does begin to get the cogs turning regarding opportunity. For businesses that are located in the USA, they are likely going to naturally use US web hosting and therefore latency isn’t really an issue.

However for businesses located large distances from the US, an opportunity could exist to get a small boost by either switching to US hosting or subscribing to a content delivery network with US points of presence.

A word of caution, Google uses a variety of factors to determine who or what audience in the world a websites content best services. If your website uses a country specific domain such as a .com.au, that website will automatically be more relevant to users searching from within Australia. If however you use a global top level domain such as a .com, Google uses a variety of factors to determine what audience around the world best fits that content. Among those factors is the location of the web hosting, such that a .com using UK web hosting would be more relevant to United Kingdom users. If your business primarily targets a country such as Australia and uses a gTLD, Google Webmaster Tools provides a mechanism to apply geotargeting to the website. By applying geotargeting to a gTLD and a small number of special case ccTLD, it will cause Google to associate that website to the geographic area of the world specified as if it was using the associated country specific domain.