What is the ideal database that allows fast cosine distance?
Clash Royale CLAN TAG#URR8PPP
$begingroup$
I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.
Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).
Which database offers the flexibility to run such a query efficiently ?
I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.
I found this article about doing it in SQL.
EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.
feature-extraction databases
$endgroup$
add a comment |
$begingroup$
I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.
Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).
Which database offers the flexibility to run such a query efficiently ?
I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.
I found this article about doing it in SQL.
EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.
feature-extraction databases
$endgroup$
1
$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23
$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27
add a comment |
$begingroup$
I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.
Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).
Which database offers the flexibility to run such a query efficiently ?
I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.
I found this article about doing it in SQL.
EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.
feature-extraction databases
$endgroup$
I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so.
Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a subset within the db (which will most likely be in the order of thousands of entries per subset in the worst case scenario).
Which database offers the flexibility to run such a query efficiently ?
I looked into postgres but I was wondering if there were alternatives that better fit this problem. Not sure it matters much, but I'm most likely going to be using Python.
I found this article about doing it in SQL.
EDIT: I am open to alternative solutions for this problem that are not necessarily tied to SQL.
feature-extraction databases
feature-extraction databases
edited Feb 13 at 19:30
G4bri3l
asked Feb 13 at 19:18
G4bri3lG4bri3l
1235
1235
1
$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23
$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27
add a comment |
1
$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23
$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27
1
1
$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23
$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23
$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27
$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.
Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.
I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.
$endgroup$
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
1
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
1
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
add a comment |
$begingroup$
- Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?
- You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.
$endgroup$
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
1
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
1
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
add a comment |
$begingroup$
If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.
From Wikipedia,
In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.
Here is the research paper from Google and here several implementations in Python
$endgroup$
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45533%2fwhat-is-the-ideal-database-that-allows-fast-cosine-distance%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.
Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.
I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.
$endgroup$
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
1
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
1
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
add a comment |
$begingroup$
If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.
Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.
I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.
$endgroup$
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
1
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
1
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
add a comment |
$begingroup$
If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.
Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.
I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.
$endgroup$
If it's only a few thousand entries each with a 1,000 features, you may just be able to keep it in RAM if you are running this on some kind of server. Then when you get a new feature vector, just run cosine similarity routine. An easy way to do this is just use something standard like pandas and scikit-learn.
Alternatively you can keep everything in SQL, load it into something like pandas and use scikit-learn.
I'm actually not sure you'll get much of a speed up, if any, by writing the computation in SQL itself.
answered Feb 13 at 19:47
WesWes
46912
46912
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
1
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
1
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
add a comment |
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
1
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
1
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
$begingroup$
Nice thank you, I thought for some reason there existed a db implementation optimized for these kind of tasks. Now out of curiosity, what if the entries grow by an order of magnitude ?
$endgroup$
– G4bri3l
Feb 13 at 20:06
1
1
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
I think you just have to try a couple of options and benchmark for your use case. If you have double precision numbers and 10,000 vectors with 1,000 features each, that's about 80 MB plus overhead for whatever data structure you might be using (like a pandas dataframe). Should be doable to store in RAM for most systems. But that use case is for a server where you load it once at server startup.
$endgroup$
– Wes
Feb 13 at 20:46
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
$begingroup$
Oh nice, loading it once at startup time sounds like the way to go, I think that can help share the allocated memory for different requests as well. So I load them at startup and they can be accessed at any time from any request, at that point it will be more computation than anything, I think I have a pretty good idea on how to limit memory usage. Thanks for the inspiration!
$endgroup$
– G4bri3l
Feb 13 at 21:04
1
1
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
$begingroup$
Please be sure to mark this as the accepted answer if it solved your problem.
$endgroup$
– Wes
Feb 19 at 17:03
add a comment |
$begingroup$
- Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?
- You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.
$endgroup$
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
1
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
1
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
add a comment |
$begingroup$
- Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?
- You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.
$endgroup$
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
1
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
1
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
add a comment |
$begingroup$
- Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?
- You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.
$endgroup$
- Is there a reason why you need to do this in SQL? Most architecture patterns would advise against keeping formulas and logic in the database layer. Why not create another layer - outside the database - with a language that can do the computations you need?
- You can also do the calculations ahead of time and store them in a cached lookup table on the database. Do all the computations you need and then import them into your database and then just run standard SQL SELECT statements to pull the results at run-time.
answered Feb 13 at 19:30
I_Play_With_DataI_Play_With_Data
1,117425
1,117425
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
1
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
1
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
add a comment |
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
1
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
1
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
$begingroup$
Well I would do that but the comparison will be between a subset of feature vectors in the db against an unseen feature vector. So I can do some math ahead of time, but the final distance will need to be computed for every request. I hope it makes sense, let me know if you need more clarity.
$endgroup$
– G4bri3l
Feb 13 at 19:34
1
1
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
@G4bri3l OK, so maybe the second option is out. But the first option is still valid. You really should try to avoid having logic in SQL - the database should be your repository, not a computational engine.
$endgroup$
– I_Play_With_Data
Feb 13 at 19:35
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
$begingroup$
That makes sense absolutely, as you can guess I'm a bit new to these kind of problems so, how should I go about this? I feel like I'm trying to find a solution within the realm of what I know, I'm just not very aware of possible approaches for this type of problems.
$endgroup$
– G4bri3l
Feb 13 at 19:47
1
1
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
@G4bri3l When you say that you receive a "request", how exactly is that coming in to you? Is that a physical request? Or some sort of automated system? What you want to do is introduce a new layer that sits in between your request system and your database. Let's assume that it's a python script. The inputs to the script would be whatever is provided by the request. Then let python do the database querying for the remaining outputs and then have a function that does the calculation you need. The python output should be your final result that sends it back to the request and the process moves on
$endgroup$
– I_Play_With_Data
Feb 13 at 19:51
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
$begingroup$
Trying to keep it simple, a request is sent to an API, this API sends a request to another service and gets back a feature vector. Now this feature vector needs to be compared to a subset of feature vectors in a db so that I can return the closest match. If I am understanding this right, I might as well request the subset of feature vectors I need and then just let the server do the comparison using its own memory. So the API is the one doing the comparison and I don't do it on the data layer.
$endgroup$
– G4bri3l
Feb 13 at 20:04
add a comment |
$begingroup$
If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.
From Wikipedia,
In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.
Here is the research paper from Google and here several implementations in Python
$endgroup$
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
add a comment |
$begingroup$
If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.
From Wikipedia,
In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.
Here is the research paper from Google and here several implementations in Python
$endgroup$
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
add a comment |
$begingroup$
If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.
From Wikipedia,
In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.
Here is the research paper from Google and here several implementations in Python
$endgroup$
If you are afraid that the dataset is big that a regular database might not handle it, you could consider an alternative implementation such as SimHash.
From Wikipedia,
In computer science, SimHash is a technique for quickly estimating how
similar two sets are. The algorithm is used by the Google Crawler to
find near duplicate pages. It was created by Moses Charikar.
Here is the research paper from Google and here several implementations in Python
answered Feb 13 at 19:28
TasosTasos
985630
985630
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
add a comment |
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
$begingroup$
Thanks I'll look into it and see if it fits my case.
$endgroup$
– G4bri3l
Feb 13 at 19:32
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45533%2fwhat-is-the-ideal-database-that-allows-fast-cosine-distance%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
How large is your dataset? Is it possible to precompute the cosine similarity between every vector and store that instead (or in addition)?
$endgroup$
– Wes
Feb 13 at 19:23
$begingroup$
I will edit the question, the database won't be too big (talking about thousands of entries). The problem is that I don't care about the similarity between one another but I will submit a request with an unseen feature vector and I will have to compute the similarity against a subset of feature vectors (already in the db) and return the closest ones. Not sure I explained myself as well as I think, so let me know if it's still not clear.
$endgroup$
– G4bri3l
Feb 13 at 19:27