Skip to content


Free SEO Tips – RSS Feeds

 

------------------------------------------------------

TopRankBlog.com

Interview: Mel Carson of Microsoft Advertising on Social Media - Mon, 06 Sep 2010
Spotlight on Search Interview: Mel Carson of Microsoft Advertising on How Microsoft Does Social Media and the Yahoo Bing Search Alliance If you attend Search Marketing industry conferences, you’ve no doubt run into the ever optimistic and charming Mel Carson from Microsoft.  When I was last in London, Mel connected me with an excellent Fish [...]

Win a Free Pass to MN Blogger Conference - Thu, 02 Sep 2010
Update: Congratulations! Rebecca Flansburg has won the free pass to the 1st Minnesota Bloggers Conference for her post: “Mama wants to go MN Blogger Bad” It was close, she won by just 2 votes over Josh Braaten and Patrick Garmoe. Congratulations again Rebecca and thank you for a clever and compelling post! TopRank Online Marketing [...]

------------------------------------------------------

SEOmoz.org

Latent Dirichlet Allocation (LDA) and Google's Rankings are Remarkably Well Correlated

Posted by randfish

Last week at our annual mozinar, Ben Hendrickson gave a talk on a unique methodology for improving SEO. The reception was overwhelming - I've never previously been part of a professional event where thunderous applause broke out not once but multiple times in the midst of a speaker's remarks.

Ben Hendrickson of SEOmoz speaking at the London Distilled/SEOmoz PRO Training
_
Ben Hendrickson speaking in last Fall at the Distilled/SEOmoz PRO Training London
(he'll be returning this year)

_

I doubt I can recreate the energy and excitement of the 320-person filled room that day, but my goal in this post is to help explain the concepts of topic modeling, vector space models as they relate to information retrieval and the work we've done on LDA (Latent Dirichlet Allocation). I'll also try to explain the relationship and potential applications to the practice of SEO.

A Request: Curiously, prior to the release of this post and our research publicly, there have been a number of negative remarks and criticisms from several folks in the search community suggesting that LDA (or topic modeling in general) is definitively not used by the search engines. We think there's a lot of evidence to suggest engines do use these, but we'd be excited to see contradicting evidence presented. If you have such work, please do publish!

The Search Rankings Pie Chart

Many of us are likely familar with the ranking factors survey SEOmoz conducts every two years (we'll have another one next year and I expect some exciting/interesting differences). Of course, we know that this aggregation of opinion is likely missing out on many factors and may over or under-emphasize the ones it does show.

Here's an illustration I created for a presentation recently to help illustrate the major categories in the overall results:

Illustration of Ranking Factors Survey Data

This suggests that many SEOs don't ascribe much weight to on-page optimization
_

Imyself have often felt that from all the metrics, tests and observations of Google's ranking results, the importance of on-page factors like keyword usage or TF*IDF (explained below) is fairly small. Certainly, I've not observed many results, even in low competitive spaces, where one can simply add in a few more repetitions of the keyword, maybe toss in a few synonyms or "related searches" and improve rankings. This experience, which many SEOs I've talked to share, has led me to believe that linking signals are an overwhelming majority of how the engines order results.

But, I love to be wrong.

Some of the work we've been doing around topic modeling, specifically using a process called LDA (Latent Dirichlet Allocation), has shown some surprisingly strong results. This has made me (and I think a lot of the folks who attended Ben's talk last Tuesday) question whether it was simply a naive application of the concept of "relevancy" or "keyword usage" that gave us this biased perspective.

Why Search Engines Need Topic Modeling

Some queries are very simple - a search for "wikipedia" is non-ambiguous, straightforward and can be effectively returned by even a very basic web search engine. Other searches aren't nearly as simple. Let's look at how engines might order two results - a simple problem most of the time that can be somewhat complex depending on the situation.

Query for Batman

Query for Chief Wiggum

Query for Superman

Query for Pianist

For complex queries or when relating large quantities of results with lots of content-related signals, search engines need ways to determine the intent of a particular page. Simply because it mentions the keyword 4 or 5 times in prominent places or even mentions similar phrases/synonyms won't necessarily mean that it's truly relevant to the searcher's query.

Historically, lots of SEOs have put effort into this process, so what we're doing here isn't revolutionary, and topic models, LDA included, have been around for a long time. However, no one in the field, to our knowledge, has made a topic modeling system public or compared its output with Google rankings (to help see how potentially influential these signals might be). The work Ben presented, and the really exciting bit (IMO), is in those numbers.

Term Vector Spaces &Topic Modeling

Term vector spaces, topic modeling and cosine similarity sound like a tough concepts, and when Ben first mentioned them on stage, a lot of the attendees (myself included) felt a bit lost. However, Ben (along with Will Critchlow, whose Cambridge mathematics degree came in handy) helped explain these to me, and I'll do my best to replicate that here:

Simplistic Term Vector Model

In this imaginary example, every word in the English language is related to either "cat" or "dog," the only topics available. To measure whether a word is more related to "dog," we use a vector space model that creates those relationships mathematically. The illustration above does a reasonable job showing our simplistic world. Words like "bigfoot" are perfectly in the middle with no more closeness to "cat" than to "dog." But words like "canine" and "feline"are clearly closer to one that the other and the degree of the angle in the vector model illustrates this (and gives us a number).

BTW- in an LDAvector space model, topics wouldn't have exact label associations like "dog"and "cat" but would instead be things like "the vector around the topic of dogs."

Unfortunately, I can't really visualize beyond this step, as it relies on taking the simple model above and scaling it to thousands or millions of topics, each of which would have its own dimension (and anyone who's tried knows that drawing more than 3 dimensions in a blog post is pretty hard). Using this construct, the model can compute the similarity between any word or groups of words and the topics its created. You can learn more about this from Stanford University's posting of Introduction to Information Retrieval, which has a specific section on Vector Space Models.

Correlation of our LDA Results w/ Google.com Rankings

Over the last 10 months, Ben (with help from other SEOmoz team members) has put together a topic modeling system based on a relatively simple implementation of LDA. While it's certainly challenging to do this work, we doubt we're the first SEO-focused organization to do so, though possibly the first to make it publicly available.

When we first started this research, we didn't know what kind of an input LDA/topic modeling might have on search engines. Thus, on completion, we were pretty excited (maybe even ecstatic) to see the following results:

Correlation Between Google.com Rankings and Various Single Metrics
Spearman Correlation of LDA, Linking IPs and TF*IDF

(the vertical blue bars indicate standard error in the diagram, which is relatively low thanks to the large sample set)
_

Using the same process we did for our release of Google vs. Bing correlation/ranking data at SMX Advanced (we posted much more detail on the process here), we've shown the Spearman correlations for a set of metrics familiar to most SEOs against some of the LDA results, including:

  • TF*IDF - the classic term weighting formula, TF*IDF measures keyword usage in a more accurate way than a more primitive metric like keyword density. In this case, we just took the TF*IDF score of the page content that appeared in Google's rankings
  • Followed IPs - this is our highest correlated single link-based metric, and shows the number of unique IP addresses hosting a website that contains a followed link to the URL. As we've shown in the past, with metrics like Page Authority (which uses machine learning to build more complex ranking models) we can do even better, but it's valuable in this context to just think and compare raw link numbers.
  • LDA Cosine - this is the score produced from the new LDAlabs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.

The correlation with rankings of the LDA scores are uncanny. Certainly, they're not a perfect correlation, but that shouldn't be expected given the supposed complexity of Google's ranking algorithm and the many factors therein. But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google's algorithm that we don't yet understand naturally biases towards these.

However, given that many SEObest practices (e.g. keywords in title tags, static URLs and ) have dramatically lower correlations and the same difficulties proving causation, we suspect a lot of SEO professionals will be deeply interested in trying this approach.

The LDA Labs Tool Now Available; Some Recommendations for Testing& Use

We've just recently made the LDA Labs tool available. You can use this to input a word, phrase, chunk of text or an entire page's content (via the URL input box) along with a desired query (the keyword term/phrase you want to rank for) and the tool will give back a score that represents the cosine similarity in a percentage form (100% = perfect, 0% = no relationship).

LDA Topics Tool

When you use the tool, be aware of a few issues:

  • Scores Change Slightly with Each Run
    This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5%each time you check a page/content block against a query.
  • Scores are for English Only
    Unfortunately, because our topics are built from a corpus of English language documents, we can't currently provide scores for non-English queries.
  • LDA isn't the Whole Picture
    Remember that while the average correlation is in the 0.33 range, we shouldn't expect scores for any given set of search results to go in precisely descending order (a correlation of 1.0 would suggest that behavior).
  • The Tool Currently Runs Against Google.com in the USonly
    You should be able to see the same results the tool extracts from by using a personalization-agnostic search string like http://www.google.com/xhtml?q=my+search&pws=0
  • Using Synonyms, "Related Searches" or Wonder Wheel Suggestions May Not Help
    Term vector models are more sophisticated representations of "concepts"and "topics," so while many SEOs have long recommended using synonyms or adding "related searches" as keywords on their pages and others have suggested the importance of "topically relevant content" there haven't been great ways to measure these or show their correlation with rankings. The scores you see from the tool will be based on a much less naive interpretation of the connections between words than these classic approaches.
  • Scores are Relative (20%might not be bad)
    Don't presume that getting a 15%or a 20% is always a terrible result. If the folks ranking in the top 10 all have LDA scores in the 10-20%range, you're likely doing a reasonable job. Some queries simply won't produce results that fit remarkably well with given topics (which could be a weakness of our model or a weirdness about the query itself).
  • Our Topic Models Don't Currently Use Phrases
    Right now, the topics we construct are around single word concepts. We imagine that the search engines have probably gone above and beyond this into topic modeling that leverages multi-word phrases, too, and we hope to get there someday ourselves.
  • Keyword Spamming Might Improve Your LDA Score, But Probably Not Your Rankings
    Like anything else in the SEO world, manipulatively applying the process is probably a terrible idea. Even if this tool worked perfectly to measure keyword relevance and topic modeling in Google, it would be unwise to simply stuff 50 words over and over on your page to get the highest LDA score you could. Quality content that real people actually want to find should be the goal of SEO and Google's almost certainly sophisticated enough to determine the different between junk content that matches topic models and real content that real users will like (even if the tool's scoring can't do that).

If you're trying to do serious SEO analysis and improvement, my suggested methodology is to build a chart something like this:

Analysis of
SERPs analysis of "SEO"in Google.com w/ Linkscape Metrics +LDA (click for larger)

Right now, you can use Keyword Difficulty's export function and then add in some of these metrics manually (though in the future, we're working towards building this type of analysis right into the web app beta).

Once you've got a chart like this, you can get a better sense of what's propping up your competitors rankings - anchor text, domain authority, or maybe something related to topic modeling relevancy (which the LDAtool could help with).

Undoubtedly, Google's More Sophisticated than This

While the correlations are high, and the excitement around the tool both inside SEOmoz and from a lot of our members and community is equally high, this is not us "reversing the algorithm." We may have built a great tool for improving the relevancy of your pages and helping to judge whether topic modeling is another component in the rankings, but it remains to be seen if we can simply improve scores on pages and see them rise in the results.

What's exciting to us isn't that we've found a secret formula (LDA has been written about for years and vector space models have been around for decades), but that we're making a potentially valuable addition to the parts of SEOwe've traditionally had little measurement around.

BTW - Thanks to Michael Cottam, who suggested the reference of research work by a number of Googlers on pLDA. There are hundreds of papers from Google and Microsoft (Bing) researchers around LDA-related topics, too, for those interested. Reading through some of these, you can see that major search engines have almost certainly built more advanced models to handle this problem. Our correlation and testing of the tool's usefulness will show whether a naive implementation can still provide value for optimizing pages.

For those who'd like to investigate more, we've made all of our raw data available here (in XLS format, though you'll need a more sophisticated model to do LDA). If you have interest in digging into this, feel free to email Ben at SEOmoz dot org.

How Do IExplain this to the Boss/Client?

The simplest method I've found is to use an analogy like:

If we want to rank well for "the rolling stones" it's probably a really good idea to use words like "Mick Jagger," "Keith Richards," and "tour dates." It's also probably not super smart to use words like "rubies,""emeralds,""gemstones," or the phrase "gathers no moss," as these might confuse search engines (and visitors) as to the topic we're covering.

This tool tries to give a best guess number about how well we're doing on this front vs. other people on the web (or sample blocks of words or content we might want to try). Hopefully, it can help us figure out when we've done something like writing about the Stones but forgetting to mention Keith Richards.

As always, we're looking forward to your feedback and results. We've already had some folks write in to us saying they used the tool to optimize the contents of some pages and seen dramatic rankings boosts. As we know, that might not mean anything about the tool itself or the process, but it certainly has us hoping for great things.

p.s. The next step, obviously, is to produce a tool that can make recommendations on words to add or remove to help improve this score. That's certainly something we're looking into.

p.p.s. We're leaving the Labs LDA tool free for anyone to use for a while, as we'd love to hear what the community thinks of the process and want to get as broad input as possible. Future iterations may be PRO-only.


Do you like this post? Yes No

Two Quick, Simple Social Media Tips

Posted by RobOusbey

Today, I want to share two pieces of advice that are particularly useful to certain types of business - and will be exceptionally quick to implement. I've also created a free download that might help some people implement one of these ideas even more quickly.

About two years ago, I made a recommendationto a client in the UK, and I've just seen it used by a hotel in the USA. If your business offers public computers with internet access - such as those in hotel lobbies, libraries, etc - this is for you:

Tip 1: Put up a sign, next to your public computers, with a call to action; typically this could be something like 'Find us on Facebook' or 'Follow us on Twitter'.

Here's such a poster in use, at the Ledgestone Hotel in Yakima. (Click the image to embiggen.)

Sadly, it doesn't look like the Ledgestone is doing much with their Twitter account; this probably disappoints people who go to their page, and so they don't end up with as many followers as they could do. Remember - getting people to your Twitter page (or Facebook, or whatever else you're asking them to do) is only the first stage - there has to be something there for them when they arrive.

The second tip is more for people who offer wi-fi - this could be all manner of hotels, conference venues, airports, aeroplanes, train stations, coffee shops, etc. For places that offer free wi-fi, this can work even better:

Tip 2: You control the first page visitors see after logging on to your wi-fi. Don't waste this with a dull message; make the page interesting, and put some calls to action on there.

People have probably logged on to do something - but many will welcome a distraction - particularly if you keep the request brief. Create a nicely styled, but simple page, and add a couple of message on there. Some examples could include:

  • Follow us on Twitter / Like us on Facebook: you could incentivize this, for example: if you're a coffee shop, then offer a free latte to new followers
  • Sign up to our email newsletter: this will only take them a second if you make sure the form is right there on the page, and again this can be incentivized
  • Don't forget to check in on foursquare: ideal for almost any location, and this is as good a time as any to remind them to check in
  • If you're enjoying your stay, please review us: particularly useful for hotels, where online reviews can increase visibility; I'll go into a little more detail about this below.

There can be some issues with sites noticing that a lot of people from the same IP are visiting, particularly when it comes to review services. Local search expert David Mihm advised me that he's heard Yelp in particular does try to filter our multiple reviews from the same IP, and that TripAdvisor's fraud rules do include clauses that might get you into trouble (such as offering incentives for people to write reviews is not permitted.)

I'd recommend that there are two steps around this type of issue:

  1. Try to appeal for reviews only from people who already have accounts on those sites (e.g.: "If you're a Yelp member, please review us here...." or "If you have a Google account, please leave a review here..."
  2. Make this 'post-wifi-login' page available on the public internet; review sites should be able to recognize that lots of people are being referred to your page from the same URL - if it's public then they'll be able to visit that page, and should figure out what is going on.

I've built a quick free template for you to to download as a starting point. You can visit the file, or download it, by clicking this link:free wifi login CTA page.

(That was created based on a template from LayoutGala; I'm not going to add any licence to it, other than use it however you want. You should change the image that are in it to be local files at the very least.)

Honestly, it doesn't take long to print off a couple of small posters (or even to publish a nice wifi login page) so I'll hope to see social-media CTAs cropping up all over the place soon. :)


Do you like this post? Yes No

------------------------------------------------------

SEOBook.com

Labor Day = Yeah - Mon, 06 Sep 2010

When you think of labor day what comes to mind? For me it is these 2 thoughts

  • lower earnings because few people are online today
  • since almost nobody is online, any hours worked today are me getting ahead of the market ;)

Working hard & working long hours can almost be a disease...the web makes it easy to be addicted.

But for every person who is putting in hard work trying to help people there is another person selling image.

The big issue with the image game is the risks. As the lies pile up they corner people into a bad situation, to where they can (and do) lose everything.

If I had to take a single point of reference to help a stranger judge the difference between a hack and someone who wants to honestly help people, I would say it is this: do they encourage you to take on debt.

  • If they do then there is a good chance they are the type of person who will go out of their way to screw you.
  • If they do not then they are likely not a maximizer type (because if they were then they would be encouraging you to go into debt to sell you more stuff).

It is not that all debt is evil (when I got started online I was naive enough to start on a credit card), but life and markets are unpredictable. If I wasn't smart enough to get a job to cover my 6 or so months of education before going full time online who knows where I would now be. What seems like a short term gain can lead to longterm failure. We are human, and so we are flawed. Wen you have debt/leverage you have no spare parts. So if something goes wrong you are done. Nassim Taleb spoke about the importance of savings and diversity of revenues as keys to survival, while noting that the very structure of our public markets encourages risk + leverage (options encourage short term performance & volatility rather than sustained growth, and you hope the guy on the next watch is stuck holding the Madoff ponzi bag).

The falls of past empires have typically been preceded by rapid inflation in food costs. Our food supply, like most other aspects of modern day life, has been so extended as to be poisonous. Fishes soaked in chemicals literally change sex back and forth, and shrimp in the ocean (with traces of Prozac) swim toward the light - where they get ate.

Its not about fixing the conversation. Its about filling in the blanks. If people are prone to click on something that is exactly what they will get, even if it is not something they want.

We misinform kids about sex in a way that can screw up the rest of their lives. Against the will of people data is collected so that they may be stalked and harassed. If you once thought you were fat in the past, long after becoming anorexic there will still be ads reminding you how fat you are, following you around the web.

When bits of culture die the life lessons wrapped in it fade as well. Sure there may be HTML codes for emotions, but (beyond ad targeting) it is hard to reduce people to number.

Is the push toward homoginization to increase yield and chasing the lowest common denominator making people happier or more miserable?


I realize that reading the above can quickly make me sound like some ultra left-winged hippie, but the point of this post is not a political one ... rather one on the basic rule of law.

We justify (or downplay) harming ourselves, our environments, and the environments of other animals so we can have more and better. But to do this we often take on debt and leverage and put ourselves in precarious situations. Worse yet, we often have *others* decide to take on leverage for us, without our desire or permission.

Why is it that the government is giving Google tax credits to build more low income housing while the Federal Reserve is sitting on over $1 trillion in bad mortgage paper? How can the government want to make housing cheaper / more affordable while simultaneously propping up (and thus ensuring overvaluation of) virtually the whole of the market? How can the government taking both sides of the same bet lead to anything but waste, fraud & abuse?

If you believe in efficient market theory then banking should represent a small portion of the profit pool (since banks are all dealing in the same commodity of cash). And yet the banking class keeps representing a growing portion of the profits, while the bad sides of their trades (the losses) are passed on to tax payers.

I don't mind someone else levering up with risk so long as they have to pay the consequences of their failures. But capitalism without failure is like religion without sin.

These banks threatened tanks in the street if they didn't get their bailouts.

They went so far as to say even auditing the Federal Reserve would threaten the financial system. Sorry, um, but that is exactly what the banking class did. If they are not punished for committing crimes then the lawlessness will only grow more extreme, as it has.

When the bubble popped some of these scammers, charlatans, shysters, swindlers, and tricksters claimed that "nobody saw it coming," but in fact as things started to go wrong these folks leaned into it and made it worse.

Rather than having CDOs go unsold they engaged in self-dealing & kept mixing the bad chunks in, sorta like making new sausage out of old sausage. They knew what they were doing. They intended to commit fraud:

"On paper, the risky stuff was gone, held by new independent CDOs. In reality, however, the banks were buying their own otherwise unsellable assets."
...
"One rival investment banker says Merrill treated CDO managers the way Henry Ford treated his Model T customers: You can have any color you want, as long as it's black."

Its labor day. The criminal bankers who ripped you off in the past, who are currently ripping you off with more crimes, and who will rip your children off are stealing your labor. And since neither political party cares to stop it its up to you how much you want to give...there is no end to how much they would love to take. Time for me to take a break. ;)

Are You Thinking Like Google? - Fri, 03 Sep 2010


No, not like that, but in the good way! :D

The following is a guest post by Jim Kukral highlighting one of the most fundamental tips to succeeding online.

Have you ever really taken a step back from all the technical SEO stuff and thought about why Google wins? The real reasons why they have mass-market share and why they continue to dominate? It's time you should, because once you understand how to start thinking like Google, you can finally begin to go beyond just ranking better, but also how to be a master Internet marketer so you can get more sales, leads and publicity.

After all, once you've been found, you now have to convert. Otherwise, it's a waste of time.

So why does Google win? Because Google is the world's biggest, and best, problem solver. The truth is that there are only two reasons why we all go online, using Google or not. Those two reasons are:

1. To have a problem solved
2. To be entertained

That's it. Everything, and I mean everything you do online falls under one of those categories. For example, let's say you're planning on cooking your wife her favorite chicken marsala dish for your anniversary. You go online and do a search for "chicken marsala recipes". Boom, you now have recipes, and videos, and images and cookbooks and all kinds of information to help you solve your problem.

As another example, let's say you wanted to relax after work and watch your favorite musician play some of your favorite songs. You go to YouTube and do a search for "Rolling Stones Videos" and boom, you're now watching video content that entertains you.

YouTube, which is owned by Google, is already the number two most searched search engine on the Internet (behind Google of course). That means that today billions of people are actively searching the Internet for video content. That also means that because of the public's fast-growing massive hunger for content in video form, that regular people and businesses alike are now able to profit from the creation of that said video content.

The truth is, Google (and your business) has to solve problems for their (your) customers, the Internet searcher. If they (you) can't do that, they (you) lose customers. It's that black and white.

So I'll ask you again. Are you thinking like Google? Have you sat down and figured out what your target audience's biggest problems are? If you haven't done that you need to do it now. Anticipate what they need. Figure out their pain and then create products/services that take that pain away.

Just like Google.

For over 15-years, Jim Kukral has helped small businesses and large companies like Fedex, Sherwin Williams, Ernst & Young and Progressive Auto Insurance understand how find success on the Web. Jim is the author of the book, "Attention! This Book Will Make You Money", as well as a professional speaker, blogger and Web business consultant. Find out more by visiting www.JimKukral.com. You can also follow Jim on Twitter @JimKukral.

------------------------------------------------------

  • Share/Bookmark

Posted in SEO and FREE Website Traffic Generation.

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

(required)

(required, but never shared)

or, reply to this post via trackback.