Query catching for fun and profit

We’ve been quite successful in capturing query strings from client calls to use as service arguments. The class we developed in Args() condensed it into a neat little package. Queries abound in HTTP and there is more we could capture than our own service’s client data and more we could do it all.

We want to fill a slightly different niche than the Args() object does. We’re not going to generalize the object to, for example, accept an arbitrary URI and parse out the query from it. We’re only going to look at the referrer or the page we’re currently on, in that order, for a query string.

The Query() constructor

function Query () {
  var qString = document.referrer.replace(/^[^\?]+\??/,'') ||
        document.search.replace(/^\?/,'');
  qString = qString.replace(/#[^#]*$/, '');
  if ( qString ) {
     this.queryString = qString;
     this.params = this.parseQuery();
  }
}

So far we’re doing almost exactly the same thing as our Args() constructor with the exception that we’re doing it on a different target; the document.referrer if it’s there or else the document.location.href. Of course neither may have a query string but that’s where we’re looking.

A change that becomes necessary at this point is to account for the possibility there will be an anchor name present at the end of the URI. We need to strip them off when the appear. This line does it neatly.

qString = qString.replace(/#[^#]*$/, '');

The parseQuery() method for Query() is identical to one we developed for Args() with one exception. We’re no longer in control of the standard for the query strings. That’s means we’ll have to change one line to reflect common usage. We’ve got no choice.

var Pairs = query.split(/;/);
var Pairs = query.split(/[;&]/);

What we want to do differently from Args()

The query string itself is easy to get from a URI, a src attribute, the document.referrer, or the document.location.href. Getting the part that we care about isn’t easy.

We don’t want all the arguments, as useful as they can be. We only care about a specific value in the whole query. The most human readable one we can find. This usually corresponds to the search parameter from a referring search engine. Anything else we find instead that is semantic is likely to be useful. The other parts of the query string will likely be internal to the referring source and uninteresting for our purposes.

For example, if Google sends us a visitor, the referring query string includes their search parameters; ie, what they typed in the search box. That’s valuable. If we know what a user came for, we can tailor a service’s response to match. Narrowing the target of a user’s interest is what marketing companies pay great green for.

What isn’t valuable, generally, is all the little details that will be included in addition to what the seeker asked about. There might be a dozen or more extra parameters concerning location, language, encoding, page, browser, and search controls.

We have been parsing queries in a controlled situation up to now; the API for a service we’ve written. The query string variables from other sources are subject to the whims of every web developer on the planet.

Let’s take a look at referring query strings from four major search engines.

Google

?q=ayn+rand+ws+burroughs& start=0&
start=0& IQ=utf-8& oe=utf-8& client=firefox-a&
rls=org.mozilla:en-US:official

Yahoo

?p=failed+promise& sm=Yahoo%21+Search& fr=FP-tab-web-t& toggle=1

MSN

?FORM=MSNH& srch_type=0& q=how+to+break+a+monopoly

AOL

?invocationType=topsearchbox.webhome& query=problems+in+america

or

?encquery=E40D6D15F91D D0F7A80DB1C
6DD07EC79B4CB 9B7A805499BFFA6
C1FC3C4BDAB7 5311A8DB63D55A F829F304F C3B5B3981E& invocationType=keyword_rollover& ie=UTF-8

That represents 90% or better of all English search referrals. There are always sites like Naver.com to consider but if the search terms come through in English it shouldn’t matter with our technique that the referrer’s other info might be in Korean or another language.

Considering the queries above, it wouldn’t be too hard to craft a single RegExp() that would identify the right parameter key out of our pre-parsed query string.

var rx = new RegExp(/^((enc)?q(uery)?|p)$/);

That for example would catch the right part out of all the examples above. But it would fail on fringe cases often. And though we’ve probably caught 90%, why give up 10%? Also, any given site might be in a fringe that is regularly found only by search engines not represented in the top tier. We could be giving up far more than 10% with that approach.

Those fringe cases

We also need to account for beasts like this.

?bpg=http %3a %2f %2fweb.ask.com %2fweb %3fq %3dwhat %2bare %2bsome %2bwords %2bthat %2bstart %2bwith %2ban %2b %2522x %2522 %2bor %2b %2522j %2522 %2bthat %2bhave %2bto %2bdo %2bwith %2bchina %26o %3d0 %26page %3d1& q=what+are+some+words+that+start+with+an+ %22x %22+or+ %22j %22+that+have+to+do+with+china& u=http %3a %2f %2ftm.wc.ask.com %2fr %3ft %3dan %26s %3da %26uid %3d23f9b94a03f9b94a0 %26sid %3d33f9b94a03f9b94a0 %26qid %3dDEF48D6FADD0A84DAB2FE5776B052796 %26io %3d4 %26sv %3dza5cb0de6 %26o %3d0 %26ask %3dwhat %2bare %2bsome %2bwords %2bthat %2bstart %2bwith %2ban %2b %2522x %2522 %2bor %2b %2522j %2522 %2bthat %2bhave %2bto %2bdo %2bwith %2bchina %26uip %3d3f9b94a0 %26en %3dte %26eo %3d-100 %26pt %3dK %2b %2b %2bThe %2bDevil's %2bDictionary %2bX %26ac %3d10 %26qs %3d0 %26pg %3d1 %26ep %3d1 %26te_par %3d103 %26te_id %3d %26u %3dhttp %3a %2f %2fsedition.com %2fddx %2fl %2fk.html& s=a& bu=http %3a %2f %2fsedition.com %2fddx %2fl%2fk.html& qte=0& o=0

Writing a RegExp() or two for those is possible. It wouldn’t be easy however and it would be easy to break with the next few cases. This is a hint that we’ve taken the wrong approach, altogether. We’ve taken the wrong approach.

A RegExp() is right out. We could never craft one to match the whimsy of every referrer out there because parameter keys are totally arbitrary.

How-to without voodoo

The clue is the realization that the keys are not reliable. Therefore the solution does not lie with them. That leaves the query values’ search terms.

While query keys are arbitrary, the values are not. They either make human sense as something that could be reasonably searched and found, like “World War II” and “Java for special education”, or they don’t, like “en” and “org.mozilla:en-US:official.”

Pervasive patterns emerge immediately and they correspond to good search strategy. You wouldn’t get good results searching for “en” on Google. Nor for “I” “tab” or “x.” “234234SSQWFASDF” and “” are also obviously not worth catching as search terms. If a search term is clueless, we don’t care about it anyway because if the target is nebulous, it can’t be hit.

Your best guess is better than what you think you know

The proper approach then is to iterate over the query values and applying a scoring, or weighting, system. Once we give each query value a numeric score it’s easy to choose what we’d consider the best match. It’s the value that has the highest score.

What’s probably worth catching?

Real words; book not xxxl.
Values with more than one word; blue book over blue.
Prefer longer words to shorter ones; hippopotamus over ten.

Good scoring is like good regular expressions: sometimes knowing what you don’t want is more valuable than knowing what you do.

What’s a giveaway that we don’t want it?

Short words; as, the, if, etc.
Non-word characters and punctuation; if it looks like cartoon swearing *^+$@%|&, or like _val1, we don’t want it.

With just those ideas in mind we can score quite effectively. We’ll create a chooseBest() method to wrap up both the scoring/weighting in another method, scoredParams(), and handle the sorting by weight to return the best guess for the most meaningful, to a human, value out of the query string.

First we’ll do our scoring.

Query.scoredParams()

Query.prototype.scoredParams = function () {
  var params = this.params;

  var Scored = new Object(); // Scored["value"] = numeric_score_of_value

  for ( var key in params ) {
    var val = params[key];

    var weight = 0;

    // count things that look like words, min of 3 "letters"
    var wordlikes = val.match(/(\w[-'\w]+\w\s)|(\w[-'\w]+\w)$/gi);
    weight += wordlikes ? wordlikes.length : 0;

    // we know some keys are real indicators so we'll score on them too
    if ( key.match(/^(q|p)$/) ) weight += 2;
    if ( key.match(/query|search/i) ) weight++;

    // heavily discount that which we know to be wrong
    var badChars = val.match(/[^-a-zA-Z'" ]/g);
    weight -= badChars ? ( badChars.length * 2 ) : 0;

    // value probably shouldn't talk about searches or queries
    weight -= val.match(/query|search/i) ? 1 : 0;

    Scored[val] = weight;
  }
  return Scored;
}

There are many refinements we might make. We could score down for “words” that don’t contain /[aeiouy]/i for example. It’s pretty good as is though as we demonstrate below.

Now we need to get at the best guess out of the scored list.

Query.chooseBest()

Query.prototype.chooseBest = function () {
  // build an associative array of terms to scores
  var scoredParams = this.scoredParams();

  var max = 0;
  var choice = '';
  for ( var qVal in scoredParams ) {
    if ( scoredParams[qVal] > max ) {
       choice = qVal;
       max = scoredParams[qVal];
     }
  }
  return choice;
}

The class is all done. Time to try it out against some real life examples.

Our query digging script, referQ.js

// ALL the Query() code goes here; omitted to save space

// Now we can construct objects
var query = new Query();

var params = query.params;

document.write("<ul>");

for ( var key in params ) {
  var correctedForSpace = params[key].replace(/\s/g, '+');
  correctedForSpace = correctedForSpace.split('&').join('&amp; ');
  document.write( li( key + ' --&gt; ' + 
                     b( correctedForSpace )
                     )
                     );
}

document.write("</ul>");

document.write("The winner is: ");

var choice = query.chooseBest() ?
  '<b>' + query.chooseBest() + '</b>' : '<i>none found</i>';

document.write( choice );

// little extra something to get list items tags easily
function li (str) { return '<li style="font-size:x-small">' + str + '</li>' }

// little extra something to get bold tags easily
function b (str) { return '<b class="breakable">' + str + '</b>' }

Live output of referQ.js

Try the links to update the output

NB: the demo script is tweaked to only show current page’s query string, not the referrer’s. If we were to use the referrer, it would be confusing because your choice would show up not when you clicked it but on the successive page load at which point it would become the referrer.

Alternative cat skinning techniques

This sort of query catching can be done up front with vanilla CGI but it means you have to have that program executing every page load or wrapped around you entire page logic. It can also be done behind the scenes at the Apache/webserver level with modules or tools hooked directly into the webserver like .

These are computationally expensive compared with JavaScript however, and except in the case of or Apache modules in general, they will execute more slowly for the user. They’re also more difficulty or inaccessible in general. Unless you are running your own server, or paying $100/month for a dedicated one, you probably will not be allowed to add custom modules.

One last look at all of it together.

QueryObj.js compleat

// QUERY object library -------------------------------------------

// constructor ------------------------
function Query () {
  var qString = document.referrer.replace(/^[^\?]+\??/,'') ||
        document.search.replace(/^\?/,'');
  qString = qString.replace(/#[^#]*$/, '');
  if ( qString ) {
     this.queryString = qString;
     this.params = this.parseQuery();
  }
}

// -----------------------------------
Query.prototype.parseQuery = function () {  
  var Params = new Object();
  if ( ! this.queryString ) return Params;
  var Pairs = this.queryString.split(/[&;]/);

  for ( var i = 0; i < Pairs.length; i++ ) {
    var KeyVal = Pairs[i].split('=');
    if ( ! KeyVal.length == 2 ) continue;
    if ( ! ( KeyVal[0] || KeyVal[1] ) ) continue;
    var key = unescape( KeyVal[0] );
    var val = unescape( KeyVal[1] );
    val = val.replace(/\+/g, ' ');
    Params[key] = val;
  }
  return Params;
}

// -----------------------------------
Query.prototype.chooseBest = function () {
  // build an associative array of terms to scores
  var scoredParams = this.scoredParams();

  var max = 0;
  var choice = '';
  for ( var qVal in scoredParams ) {
    if ( scoredParams[qVal] > max ) {
       choice = qVal;
       max = scoredParams[qVal];
     }
  }
  return choice;
}

// -----------------------------------
Query.prototype.scoredParams = function () {
  var params = this.params;

  var Scored = new Object(); // Scored["value"] = numeric_score_of_value

  for ( var key in params ) {
    var val = params[key];

    var weight = 0;

    // count things that look like words, min of 3 "letters"
    var wordlikes = val.match(/(\w[-'\w]+\w\s)|(\w[-'\w]+\w)$/gi);
    weight += wordlikes ? wordlikes.length : 0;

    // we know some keys are real indicators so we'll score on them too
    if ( key.match(/^(q|p)$/) ) weight += 2;
    if ( key.match(/query|search/i) ) weight++;

    // heavily discount that which we know to be wrong
    var badChars = val.match(/[^-a-zA-Z'" ]/g);
    weight -= badChars ? ( badChars.length * 2 ) : 0;

    // value probably shouldn't talk about searches or queries
    weight -= val.match(/query|search/i) ? 1 : 0;

    Scored[val] = weight;
  }
  return Scored;
}

It’s time to turn our focus to improving the management of all the code we’ve been developing. Perhaps there is a way we can reuse the code about without having to put it into every script we’ve got. We’re ready to take a stab at Using JS code libraries.

« Args() · Using JS code libraries »