+44 (0)20 8744 4350

Pollen Euclidean Searching

image

Euclidean Distance comparison theory

According to Wikipedia, Euclidean Distances are;

 

“In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric.”

 

This essentially means that we can compare different variables on a search but by using Euclidean methods we can essentially compare different criteria against a search term and by calculating the theoretical “distance” between them we can calculate a best match accordingly.

From a programming point of view we would need to do the following to implement a Euclidean search.

 

We have five dimensions.

 

Object

Skill one

Skill two

Skill three

Distance from a point

Skill four

Search term

1

4

5

10 miles

8

Candidate

3

4

5

12 miles

5

 

To calculate the Euclidean distance between the search term and the candidate you would do the following:

 

Square root( (1-3) 2 + (4-4)2 + (5-5 )2 +(10 -12)2 +  (8-5)2 ) =

= square root (4 + 0 + 0 + 4 + 9  ) =  4.13

 

The metric 4.13 shows that the relative distance between the “search term” and the candidate.  With this metric we have a broad analysis in which we can retrieve the candidates relevant to the search metric.

The above would have to be repeated across all of the candidates within the system to come up with a way of ordering the candidates so that the candidates which most fit the criteria are highlighted to the user.

With the above we can see that even though the metrics are different by using this method we do have a way of comparing and ordering results on best match.

With the above though we can see that potentially we will have performance issues if we are going to have to run these mathematical methods on all of the results.  
Broadly speaking there are potentially three methods of dealing with the above. You could either:

 

  1. Use the server side language it retrieve the data , and then do all of the mathematics on all of the result set, and then retrieve the data ordering by the calculated “metric”
  2. Use the database itself it to do the mathematics
  3. Or use a third party such as the Lucene family of implementations to do the mathematics and querying of the database.

 

Each of the above has pro’s and cons and these are described below.

 

Using server side language

To do this you would have to retrieve the entire dataset from the database, and then iterate through the result set doing the calculations.

Once the calculations have been done we would have an array of data with the distances and the relevant id, we would then order by the distance and then retrieve a number of candidates ordered by the distance with the shorter distance being the most relevant.

The main issue is that there could potentially memory issues as the server side code will need to retrieve and then hold within its memory the dataset whilst the calculations are conducted. 

On a small dataset this wouldn’t be an issue, but obviously with a large dataset we could run into major issues.

 

Using Database

Database servers are designed to retrieve and manipulate data sets and by using SQL we can do all of the querying and calculating within the database. For example the query shown below:

 

Select * from TableName

Order by

POW((TableName.X - x), 2) + POW((TableName.Y - y), 2) + POW((TableName.Z - z), 2) asc limit 0, 30

 

The above query would be able to do all of the necessary calculations and return the 30 most relevant. 

By placing the load on the database server and optimising the queries we could ensure that the load is passed on from the serverside code ensuring that the main server remains responsive.

 

Using third party tool based on lucene

Lucene is “Apache Lucene is a freely available information retrieval software library that works with fields of text within document files.”

Essentially lucene works by creating an xml indexed data set which has been optimised to ensure that the data set can be queried efficiently. There are a number of implementations, open source and proprietary based systems based on the original open source project.

In the case doing Euclidean distance search from research, the Solr project, has a number of mathematical functions which we could use to facilitate the required functionality. As we would be using inbuilt functionality of the Api it could be relatively easily to add the Euclidean metric to the search.

By essentially passing over the search to a lucene server, the load will again be dramatically decreased as we would be using a system developed specifically for “searching”. In addition the relevancy searching that can index text across the entire database would be an added bonus.

 

Conclusion

From this top level research we can implement Euclidian distance metric searching, and there are variety of ways of implementing this.

Our recommendations for moving forward on this is too confirm that this is indeed the functionality that is required. Once confirm we should create a prototype to determine which method will provide the most efficient retrieval of results to the end user.

 

Del Langrish