Just incase you missed it, Matt is back from Boston and has posted an indepth explanation of Google’s new crawl caching proxy. (The system that is responsible for Mediabot fetching pages).

Matt’s diagram pretty much matches what we’ve been seeing, although I still think there might be some issues regarding robots.txt, but we’re still collecting some data on that, so I’m going to wait until that’s done before commenting on it.

Another question that has popped up is whether or not Mediabot has been crawling pages that don’t have AdSense. Anyone else seeing that?

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

Comments

7 Responses to “ Google’s Crawl Caching Proxy ”

  1. Jeff Preston on April 24th, 2006 7:10 am

    I looked through the stats of several sites that I webmaster for in the US and Japan that do not have Adsense and did not see any mediabot trace.

    What do you recommend in the robots.txt file to handle the new Googlebot strategy and structure?

    Thanks.

    P.S. Nice “Web Consultant” mention in Business Week a few weeks back. I said “I know that dude” with my best Jeff Spicoli voice.

  2. Eric on April 24th, 2006 10:52 pm

    Yes… my web doesnt have any adsense ads and Adsense bot comes to my site every day.

  3. michael on April 25th, 2006 7:11 pm

    Yep, it visits several sites of mine that have never used adsense.

  4. IrishWonder on April 26th, 2006 8:32 am

    Mediabot also crawls sites that have Google related links. You are probably aware of it already but I thought I’d mention.

  5. Pythod on April 26th, 2006 9:36 pm

    Hello Greg,

    First post here. I posted two messages on MC’s blog early this week, but he removed them because my questions were very much related to BH SEO. I guess Black Hats aren’t supposed to post on his blog. haha. Anyhow, here’s my thought on this: If they’re going to have a proxy, they better OBEY the rules specified in my robots.txt. For fearful reasons (Yeah, Google banning my Adsense account), I don’t allow Googlebot to crawl my site, but I definitely allow Mediapartners-Google to display ads. I was the one who spoke with oilman two weeks ago when he mentioned to you on the show about someone blocking Google and only allowing Mediapartners. I only allow non-Googlebot bots to enter my site including mediapartners. I’m fine mediapartners doing duty work as long as they STRICTLY follow my robots.txt rules. I have a few concerns over this issue:

    If I don’t allow Googlebot in, then there’s a possibility that my site’s content might not be fetched which will allow adsense to display relevant ads. Let’s consider the following scenario: Googlebot comes in, my site says, “Googlebot, I don’t let you in, but I allow Mediapartners”, so Googlebot goes back tells proxy that the site didn’t allow me. Now, when I ask Adsense to display ads, Adsense will check with proxy to find if that content is already there for that URL. In this case, proxy says “no, the site denied entering”. IF adsense completely relies upon proxy, then it will say the same thing to me… “Sorry bud, I don’t have any ads for your page since you didn’t let me (all services/proxy) in and it displays public service ads.” But if the proxy says “Hey Adsense, no, the site denied letting Googlebot in, but the robots.txt says to allow Mediapartners in, so you might want to go there again and see if you can find anything and if you do, show them the relevant ads.” Now as mediapartners gets the page content, the EXPECT (and NOT HOPE) the proxy to mark that page as “mediapartners-only” page and not show that page in Google’s search index.

    I also asked Matt Cutts if they are going to consolidate how the User Agents are currently identified as while crawling. I read somewhere that Google UA’s are now identified as “Mozilla… something”. If this is going to be their ultimate outcome, they better give us some documentation on allowing/disallowing Google-service-specific bots, how they plan on obeying robots.txt AND whether or not they are strictly going to follow robots.txt for all subdomains/domains across the planet. They will save tons of bandwidth with proxy system, but they will need more computational power, which I think they don’t have any problems with.

    I used the following my robots.txt in my scenario above:

    #BOF#

    User-agent: *
    Allow: /

    User-agent: mediapartners-Google
    Allow: /

    User-agent: Googlebot
    disallow:/

    #EOF#

    What is your take on this?

    Thanks for your input and sorry for the long post.

  6. WebGuerrilla on April 26th, 2006 11:12 pm

    You aren’t the first person I’ve heard from that is worried about the scenario you describe. But I think your view of how the caching proxy works is a bit more complicated than it really is.

    The idea that the proxy will communicate to either bot whether or not a robots.txt exclusion was found probably isn’t accurate. Instead, I think it is simply a holding bin that will keep pages fetched by either bot for a certain period of time. When either bot decides it wants to revisit a page it has fetched in the past, it simply checks the cache first to see if the page it is looking for is there. If it isn’t, it makes a new request for the page.

    If that is how it works, excluding one of the two bots won’t cause a problem. If Googlebot is excluded and Mediabot checks the cache for a copy of the page, it won’t be told about the exclusion. It will just not find the page it is looking for. In that case, it fetches a new copy and everything keeps working like normal.

    The only potential problem I see happens if Googlebot checks the cache without first checking the robots.txt of the domain of the page it is looking for. Since no site running AdSense would ever exclude Mediabot, there will certainly be situations where the cache contains pages Googlebot isn’t supposed to get. If Googlebot is requesting pages from the cache without pulling a fresh robots.txt, you could end up with a situation where excluded pages end up in the database.

  7. pythod on April 28th, 2006 3:00 am

    Thanks Greg for your fast response. Well, my idea about the proxy system could be a bit complicated but it is because of the fact that Google doesn’t provide a complete documentation on how this new system works/is going to work. To digress a bit here, my BH sites rely upon the IPs than on UA. In other words, bots are provided the bot-version based on the IPs and not UA. So, if you’re a webmaster and you come across my site and you’re viewing my site using Espion, you can’t see my bot version if you fake the UA. Your IP has to be in my database for you to see the bot version.

    But anyhow, my point here is that Google should strictly follow robots.txt for any service otherwise they will have to tolerate spam that was not intended for Google and that will mean new LLCs and new EINs for every Adsense account banned. I like Google and I’m very upfront about not polluting their SERPs, but if their system gets lame day by day, I have no other choice.

Got something to say?