May 27 2007

Blog search engines & blog popularity

Published by admin at 8:58 am under blogging

Both Google and Technorati have recently released new search engines for blogs. The technorati version seems to return more up to date results but google’s has some interesting options such as “email alert”, “add blog search gadget”, “subscribe to blog search feed”.

Why do technorati list the “most popular blogs” when what they really mean is the “most linked to blogs” or the “most favorited blogs”. Steve Pavlina’s blog ranks 4252 on Alexa - way higher than many on the Technorati top 10 - but it doesn’t even make the top 100.

Why should I only care about which blogs other bloggers read? Does no-one compile a list of the blogs with the most traffic?

5 Responses to “Blog search engines & blog popularity”

  1. Jim Sturgeson 27 May 2007 at 11:50 pm

    The problem is simply part of the nature of the web. Webservers don’t hand off hit counts to browsers or web crawlers. So that information just isn’t available to the search engines.

    If blogging software did hand over hit counts in the headers, we might have something (I notice that several of the sites in your blogroll are using WordPress, and it certainly does not hand off hit counts).

    For example, this site’s meta tags tell me what software generated the page, and that you’re using a normal character set. That’s it.

    If I’m a web crawler I can cache every page in the world and then record who links to who (this was Google’s advantage over the competition). As a web crawler you have no way of directly finding out what links are coming in or how many there are; only the webserver knows that if it even bothers to record it. Google just “brute forced” the problem by caching *everything* then working backwards. It’s probably a safe assumption that that’s how they rate blogs too, which would explain why you get the results you describe above.

    Of the 5 sites I clicked on at random in your “sites” list, only one even gave a tag with a description for the crawlers to capture for refining searches.

    It’d be nice if you could trust bloggers to put a tag like , but bloggers would just have the server send out an inflated number to improve their rankings, making the whole scheme worthless. Just like the junk link-farm pages you used to get on AltaVista or the original Yahoo search (in those engines, sites were weighted with how many outgoing links connected to other highly rated sites). Google made link farms useless because it went the other way - a site got ranking based on how many other sites linked to it.

    So, we’re kinda stuck with what we’ve got. Something like digg could work (though I don’t think it is) if readers were recording the value of a blog with a central service as they read away. But, digg only rates the popularity of individual posts, not blogs. And I wasn’t impressed with what it thought was worth reading.

    Unfortunately, there is no way to get the data unless you ask the bloggers themselves. At least thats the way the web is now. And even if they did, we wouldn’t trust their data as long as they are the ones generating the numbers.

    Regards,
    Jim

  2. Jim Sturgeson 27 May 2007 at 11:54 pm

    The blogging software ate my meta tag description because meta tags are html and they don’t actually show up. Whoops!
    The tag would have looked like this: {meta name=”hits” content=”4123″}

  3. peteron 28 May 2007 at 4:07 pm

    Jim - I’m not following you. Alexa rates webistes and treats blogs just like any other website. So if I know the url of a blog then I can go to Alexa and find out how much traffic that blog gets.

    google and technorati have lists/databases of blogs that they use for their blogsearch. Taking those lists and matching up with the rankings from Alexa it should be easy to get a real blog popularity ranking .

    No?

  4. Jim Sturgeson 28 May 2007 at 8:28 pm

    Peter,

    I know for sure that there is no way for a crawler to ask, say, the shootingbynumbers webserver and get an answer as to how many hits the site and/or a post got. I just went over to w3.org and double-checked to be sure that there was nothing new in the http standards.

    However, I did a little research and have the answers about these search engines:

    Technorati: ranks pages based on incoming links: that’s it. The more links coming into your page, the higher your ranking . They appear to limit the pages that they index to those who are sending “ping” updates to their ping server, and those who go on the site and “claim” their blog.

    Alexa: aggregates data returned from the Alexa toolbar. The data lets them figure out what traffic went where based on alexa toolbar users. Most people I know don’t have an Alexa toolbar. However, if only smart people were using the toolbar, then you’d expect their answers to be pretty good. I have no idea how the Alexa demographics break down; on their site they say that they don’t even know, though there are several million toolbars in use.

    Google Blog: they have a sort of hybrid thing going on. First, the blog search only indexes sites with a syndication feed. Next, it uses incoming links to rank a page: more links, higher rank. But - it also weights those incoming links based on the ranking of the page from which the incoming link came. I believe Google is also using ping data (notifications from your website that it just updated the site) to help with ranking, though they don’t come out and say it. Since they also run Blogger.com, they can directly measure the traffic on those blogs, though they don’t say that it’s used in the blog search.

    I found one site that looked like an aggregator that was supposed to be combining data from other search engines, but the user interface was so terrible, I went my merry way. I suppose that a well-implemented aggregator might provide the best results, but I don’t know of any.

    A fun fact: Google did a survey of a billion web pages and looked to see what “tag” information of any kind came back. From how they put it, web tag use globally is a Frankenstein’s monster. There were 19 commonly used elements like: header, title, table, list item, and so forth. But they also reported a vast number of tags that aren’t supported by any browsers, improperly formatted tags, and things that were made into tags because of ignorance - nothing that a web browser would look at.

    How will we ever sort through all the junk? It will take a supposedly unbiased entity (as Google claims to be) to come up with a good way.

    I suppose that Alexa would best answer your question as to who is getting the most traffic, but unfortunately the number of clickstreams they are tracking is way too small. They are, however, the only ones I know of who are compiling actual traffic data.

    Jim

  5. peteron 29 May 2007 at 9:18 am

    Thanks for taking the time to explain Jim.

Trackback URI | Comments RSS

Leave a Reply