Sorting by String match relevancy in List

This is a continuation of this thread about Storing an Index in SQL Server.

I have decided that I am going to use the code that r937 has provided to give me a SQL table containing the following.


ID        URL            CONTENT
------------------------------------------------------------
1         a.com          it is what it is
2         b.com          what is it
3         c.com          it is a banana

As I’ve said in the previous thread I want to run a search for the term “what is it”. In doing so, I would like the rows to be sorted by relevancy, meaning that every row would be included, but row 2 would be at the top because it contains the whole term “what is it”, whereas the other two contain just the words/some of the words.

In order to do this I am going to store these contents in a new class, like so:


   public class MyObject {
       public int id { get; set; }
       public string url { get; set; }
       public string content  { get; set; }
   }

What I would like to do is create a List of type MyObject (List<MyObject>) and then sort it in the way I require (by relevancy to a string).

How would I go about doing this in C#?

Does this work?

			List<testList> testCase = new List<testList>();

			testCase.Add(new testList { Id = 1, Url = "a.com", Content = "it is what it is" });
			testCase.Add(new testList { Id = 2, Url = "b.com", Content = "what is it" });
			testCase.Add(new testList { Id = 3, Url = "c.com", Content = "it is a banana" });
			var xTest = (from t in testCase
						   orderby t.Content.IndexOf("what is it") descending
						   select t);


			foreach (var item in xTest)
			{
				Console.WriteLine("{0} {1} {2}", item.Id, item.Url, item.Content);
			}

What imaginekitty suggested should work to a certain extent, but this is a difficult solution to solve as there is a lot more than just that.

All the items that do not have the exact term in will be -1. Then there items that do contain that exact text. The content with that phrase at the end of the sentence will be at the top as it will have an index of greater than 0 where as the content with that phrase at the beginning will have an index of 0, thus being less then the former.

Idealy waht you want to do, is work out some sort of algorithim that cacluates how close the content is to a certain phrase. 10% 100%, 54%. Then you can order by that.

But coding said algorithim is where the complexity is going to come in as there are a few things you will need to check to compute this. You can try googling for help on that, but I am not sure if you will find results. Best would be to find a good starting point and then just run with it. Will be a lot of trial and error

Yeah, there won’t be any real meaning to their order other than the items that match will be above items that don’t. Pretty useless, huh? :slight_smile:

Yes, exactly why I said your code will work to a certain extent. lol. Yes, it will achieve exactly that. But if he wants to it to be exact he will need the algorithm. So “what is” will be above “what” as it is more of a match.

So it depends on what ULTiMATE wants from this. If that is all he wants to achieve and doesnt make that it doesnt matter where it appears in the content, it will appear on the top, then thats perfect.

It all comes down to the desire level on accuracy.

What you’ve described there is the exact level of complexity I’d like; I hadn’t even thought of using a percentage gauge to do it so I’ll definitely be using that. The algorithm itself is where I’m falling short and for the life of me I cannot think of an easy way to do it. Any advice on how I could write such an algorithm would be much appreciated.

I tried to edit my last post but for some reason it wouldn’t work, so here’s a new reply…

I looked over the problem a bit more and realised that my inverted index was the solution to the problem. Using the previous example used here and in the thread I posted in the Databases forum I have an inverted index like so:


"a":      {(2, 2)}
"banana": {(2, 3)}
"is":     {(0, 1), (0, 4), (1, 1), (2, 1)}
"it":     {(0, 0), (0, 3), (1, 2), (2, 0)} 
"what":   {(0, 2), (1, 0)}

The best way I can do what I want is to have some kind of score field to order these rows, preferably one I can generate through C#. The best way I can do what I want is to measure the distance between the words used. For example, if I were to search for “what is it”, I can check the inverted index and will find that the first word (what) can be found in the second row and is the first word (1, 0), from there I check the second word (is) and see that it is also in the second row and is the second word (1, 1), finally it is the second word in the same row (1, 2), meaning that the two have been found together.

What I need to do is write up a C# script for such a solution, and that’s where I’m struggling a bit. Any help on this would be greatly appreciated. Any queries and code already used can be found on the previous thread, linked in the first post.

Why not just use Sql Server Fulltext indexing on the table? That has all sorts of neat stuff to do exactly what you are looking for.

Or Lucene.Net

Full Text Search isn’t an option as I will be using the finished data structure within another and the SQL Server option doesn’t offer the flexibility that I require.

I may also be switching some parts of the application to another database and OS (probably using Oracle, although the DBA) and I don’t wish to be tied down to SQL Server. This part of the application is a tiny part of the finished product and the other factors rely on a software solution rather than a pre-made database one.

I want the beginning data stored by database, but factors defined within the software as a number of classes will need to manipulate that data.

I was going to use Lucene to handle the indexing part of my system and toyed with some Java code to begin with to try what I want with HashMaps, but I found it to be overkill for the solution that I need. If I were to continue using it I’d probably end up forking it and having to rewrite a ton of stuff to handle the image data I’d be taking in as input. I’ve been running through the book Programming Collective Intelligence as a guide for much of this stuff and what it’s shown in Python is very close to what I require, but as I don’t know Python I’ve attempted in C#.

Besides, how often does one get to work on their own inverted index structure? :smiley:

Ok, that makes sense. I’d run with Lucene myself, why reinvent the wheel?

Regardless of whether I use Lucene or not I’ll have to pass the results into its own object to factor in the other meta-data. Lucene is good if you need an efficient index that you can search, but search is one of very few aspects of this program and eventually won’t be what the user will need to do as it would just end up buried within more of my code and wouldn’t be an efficient choice as it’d just end up waiting for what statistical information and metadata are passed through anyway. It’d be like buying a Bugatti Veyron when you just need something to go down the road.

I might just try using the Python code within Programming Collective intelligence and map it to C# as it is exactly what I need anyway. Thanks for the help thus far!