What is the ideal database that allows fast cosine distance?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












3












$begingroup$


I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.



Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).



Which database offers the flexibility to run such a query efficiently ?



I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.



I found this article about doing it in SQL.



EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.










share|improve this question











$endgroup$







  • 1




    $begingroup$
    How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
    $endgroup$
    – Wes
    Feb 13 at 19:23










  • $begingroup$
    I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:27
















3












$begingroup$


I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.



Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).



Which database offers the flexibility to run such a query efficiently ?



I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.



I found this article about doing it in SQL.



EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.










share|improve this question











$endgroup$







  • 1




    $begingroup$
    How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
    $endgroup$
    – Wes
    Feb 13 at 19:23










  • $begingroup$
    I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:27














3












3








3





$begingroup$


I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.



Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).



Which database offers the flexibility to run such a query efficiently ?



I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.



I found this article about doing it in SQL.



EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.










share|improve this question











$endgroup$




I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.



Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).



Which database offers the flexibility to run such a query efficiently ?



I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.



I found this article about doing it in SQL.



EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.







feature-extraction databases






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 13 at 19:30







G4bri3l

















asked Feb 13 at 19:18









G4bri3lG4bri3l

1235




1235







  • 1




    $begingroup$
    How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
    $endgroup$
    – Wes
    Feb 13 at 19:23










  • $begingroup$
    I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:27













  • 1




    $begingroup$
    How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
    $endgroup$
    – Wes
    Feb 13 at 19:23










  • $begingroup$
    I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:27








1




1




$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23




$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23












$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27





$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27











3 Answers
3






active

oldest

votes


















2












$begingroup$

If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.



Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.



I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.






share|improve this answer









$endgroup$












  • $begingroup$
    Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
    $endgroup$
    – G4bri3l
    Feb 13 at 20:06






  • 1




    $begingroup$
    I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
    $endgroup$
    – Wes
    Feb 13 at 20:46











  • $begingroup$
    Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
    $endgroup$
    – G4bri3l
    Feb 13 at 21:04






  • 1




    $begingroup$
    Please be sure to mark this as the accepted answer if it solved your problem.
    $endgroup$
    – Wes
    Feb 19 at 17:03


















1












$begingroup$

  1. Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?

  2. You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.





share|improve this answer









$endgroup$












  • $begingroup$
    Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:34






  • 1




    $begingroup$
    @G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:35










  • $begingroup$
    That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:47







  • 1




    $begingroup$
    @G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:51










  • $begingroup$
    Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
    $endgroup$
    – G4bri3l
    Feb 13 at 20:04



















0












$begingroup$

If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.



From Wikipedia,




In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.




Here is the research paper from Google and here several implementations in Python






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks I'll look into it and see if it fits my case.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:32










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45533%2fwhat-is-the-ideal-database-that-allows-fast-cosine-distance%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









2












$begingroup$

If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.



Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.



I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.






share|improve this answer









$endgroup$












  • $begingroup$
    Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
    $endgroup$
    – G4bri3l
    Feb 13 at 20:06






  • 1




    $begingroup$
    I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
    $endgroup$
    – Wes
    Feb 13 at 20:46











  • $begingroup$
    Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
    $endgroup$
    – G4bri3l
    Feb 13 at 21:04






  • 1




    $begingroup$
    Please be sure to mark this as the accepted answer if it solved your problem.
    $endgroup$
    – Wes
    Feb 19 at 17:03















2












$begingroup$

If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.



Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.



I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.






share|improve this answer









$endgroup$












  • $begingroup$
    Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
    $endgroup$
    – G4bri3l
    Feb 13 at 20:06






  • 1




    $begingroup$
    I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
    $endgroup$
    – Wes
    Feb 13 at 20:46











  • $begingroup$
    Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
    $endgroup$
    – G4bri3l
    Feb 13 at 21:04






  • 1




    $begingroup$
    Please be sure to mark this as the accepted answer if it solved your problem.
    $endgroup$
    – Wes
    Feb 19 at 17:03













2












2








2





$begingroup$

If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.



Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.



I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.






share|improve this answer









$endgroup$



If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.



Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.



I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.







share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 13 at 19:47









WesWes

46912




46912











  • $begingroup$
    Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
    $endgroup$
    – G4bri3l
    Feb 13 at 20:06






  • 1




    $begingroup$
    I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
    $endgroup$
    – Wes
    Feb 13 at 20:46











  • $begingroup$
    Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
    $endgroup$
    – G4bri3l
    Feb 13 at 21:04






  • 1




    $begingroup$
    Please be sure to mark this as the accepted answer if it solved your problem.
    $endgroup$
    – Wes
    Feb 19 at 17:03
















  • $begingroup$
    Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
    $endgroup$
    – G4bri3l
    Feb 13 at 20:06






  • 1




    $begingroup$
    I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
    $endgroup$
    – Wes
    Feb 13 at 20:46











  • $begingroup$
    Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
    $endgroup$
    – G4bri3l
    Feb 13 at 21:04






  • 1




    $begingroup$
    Please be sure to mark this as the accepted answer if it solved your problem.
    $endgroup$
    – Wes
    Feb 19 at 17:03















$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06




$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06




1




1




$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46





$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46













$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04




$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04




1




1




$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03




$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03











1












$begingroup$

  1. Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?

  2. You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.





share|improve this answer









$endgroup$












  • $begingroup$
    Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:34






  • 1




    $begingroup$
    @G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:35










  • $begingroup$
    That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:47







  • 1




    $begingroup$
    @G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:51










  • $begingroup$
    Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
    $endgroup$
    – G4bri3l
    Feb 13 at 20:04
















1












$begingroup$

  1. Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?

  2. You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.





share|improve this answer









$endgroup$












  • $begingroup$
    Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:34






  • 1




    $begingroup$
    @G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:35










  • $begingroup$
    That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:47







  • 1




    $begingroup$
    @G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:51










  • $begingroup$
    Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
    $endgroup$
    – G4bri3l
    Feb 13 at 20:04














1












1








1





$begingroup$

  1. Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?

  2. You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.





share|improve this answer









$endgroup$



  1. Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?

  2. You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.






share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 13 at 19:30









I_Play_With_DataI_Play_With_Data

1,117425




1,117425











  • $begingroup$
    Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:34






  • 1




    $begingroup$
    @G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:35










  • $begingroup$
    That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:47







  • 1




    $begingroup$
    @G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:51










  • $begingroup$
    Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
    $endgroup$
    – G4bri3l
    Feb 13 at 20:04

















  • $begingroup$
    Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:34






  • 1




    $begingroup$
    @G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:35










  • $begingroup$
    That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:47







  • 1




    $begingroup$
    @G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
    $endgroup$
    – I_Play_With_Data
    Feb 13 at 19:51










  • $begingroup$
    Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
    $endgroup$
    – G4bri3l
    Feb 13 at 20:04
















$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34




$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34




1




1




$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35




$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35












$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47





$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47





1




1




$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51




$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51












$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04





$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04












0












$begingroup$

If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.



From Wikipedia,




In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.




Here is the research paper from Google and here several implementations in Python






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks I'll look into it and see if it fits my case.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:32















0












$begingroup$

If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.



From Wikipedia,




In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.




Here is the research paper from Google and here several implementations in Python






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks I'll look into it and see if it fits my case.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:32













0












0








0





$begingroup$

If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.



From Wikipedia,




In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.




Here is the research paper from Google and here several implementations in Python






share|improve this answer









$endgroup$



If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.



From Wikipedia,




In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.




Here is the research paper from Google and here several implementations in Python







share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 13 at 19:28









TasosTasos

985630




985630











  • $begingroup$
    Thanks I'll look into it and see if it fits my case.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:32
















  • $begingroup$
    Thanks I'll look into it and see if it fits my case.
    $endgroup$
    – G4bri3l
    Feb 13 at 19:32















$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32




$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45533%2fwhat-is-the-ideal-database-that-allows-fast-cosine-distance%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay