Summarise text field into binary variables

For my thesis research Im doing some weblog analysis using a custom database table as log.

While I’m getting most of the info I need by just querying all the cells and putting it in an array, the database contains a text coumn in the format varchar(255) that contains internal paths of the hits on my website.

I want to see if these internal paths are a certain value such as ‘cart’ or ‘cart/checkout/review’ and put that in the datafile that I use for analysis.

The problem is that I’m grouping the hits in the mysql database on unique clients or sessions, and in the grouping process not all the path values of all the hits in a session are stored, so this doesn’t work:


// Set headings for csv file
$rows = null;
$rows[] = array('Client ID','Group ID','path','reference url','Page Views','Impression','Impression Count','Segment','User ID','Total Time Recorded','timestamp','Date','Cart','Cart Review');
// Query all access rows
$query="SELECT cid,gid,url,ref,COUNT(*),MAX(impression),SUM(impression),segment,uid,SUM(timer),timestamp,date FROM statspluslog WHERE staff =0 AND cap =1 AND hostname != 'xxx' GROUP BY cid";
$result = mysql_query($query);
while ($r = mysql_fetch_assoc($result)) {
  if ($r['url'] == 'cart') {
    $r['cart'] = 1;
  } else {
    $r['cart'] = 0;
  }
  if ($r['url'] == 'cart/checkout/review') {
    $r['cartreview'] = 1;
  } else {
    $r['cartreview'] = 0;
  }
  //print_r($r);
  $rows[] = $r;
}
//Write csv to file
$fp = fopen('statspluslog-client.csv', 'w');
foreach ($rows as $row) {
  fputcsv($fp, $row);
}
fclose($fp);

print "finished CLIENT LOG TO CSV!\
\\r";


mysql_close($conn);

Instead I reckon I should have some sort of aggregate function to see if ‘cart’ exists in the url column for any of the rows that are collapsed by the GROUP BY clause, something like this:

SELECT cid,gid,url,ref,COUNT(*),MAX(impression),SUBSTRING(“cart”,url).
Unfortunately it seems such a function does not exist, so how can I access the url column for all the columns that are collapsed by the GROUP BY clause? Or is there another way to do this?

My “Simply SQL” book by rudy discusses something called pushing down the GROUP BY clauses into a subquery but I don’t really understand this part, nor see if it can apply to my situation.

thx for any help.

no, pushing down into a subquery is not appropriate here, since you’re dealing with only one table

i’m a bit confused about one thing, your GROUP BY – apparently you’re grouping by client?

the reason i ask is because you have a number of non-aggregate columns in the SELECT clause, and unless the values of those columns are identical for all rows for each cid, then the values they have in the result will be indeterminate (which might be okay for your purposes, but that would be unusual)

as for your question about the url column, if you trying to pull out of it only a yes-or-no indicator or whether a substring like ‘cart’ exists in any of the values of each group, you can use[indent]MAX(POSITION(‘cart’ IN url))[/indent]which will be 0 only if the substring did not exist in any of the values

thx rudy, what do you mean by indeterminate? I was wondering how it selects a value for the columns that don’t use any aggregate functions, is it just the first or last value in the group?

edit:followup question: is it possible to just get the first value of the collapsed cells in a column?

indeterminate means it could be any one of the values in the group

here are two groups based on cid, with a couple of extra columns thrown in –

cid   foo   bar
  9   aaa   123
  9   bbb   456
  9   ccc   789
 37   xyz   444
 37   abc   222
 37   mmm   333
 37   lol   937

if you ran this query –

SELECT cid,foo,bar,COUNT(*)
  FROM daTable
GROUP BY cid

you would get these two rows –

cid   foo   bar  COUNT(*)
  9   ???   ???     3
 37   ???   ???     4

the question marks indicate a value that you cannot determine with reliability, although you can be sure they will be one of the values from that column in the group