Numpy cosine similarity5/7/2023 ![]() ![]() Head over to and sign up for an account. Spec2Vec, which we've written about in detail, is often a better measure than Cosine distance. ![]() ![]() But there are many other ways to do this. Cosine distance is one way to measure how similar to one another spectra are. In reality, there are many different ways for things to be similar to – or different from – each other. "Similarity" is often thought of as a fixed metric. Using Omigami to compute Spec2Vec similarity scores But interestingly, most of their cosine similarity scores are low. These other spectra are part of the Stenothricin cluster. Stenothricin C M+H has a 0.02 cosine match with Stenothricin I M+H (CCMSLIB00000075077) Stenothricin C M+H has a 0.04 cosine match with Stenothricin H M+H (CCMSLIB00000075076) Stenothricin C M+H has a 0.05 cosine match with Stenothricin E M+H (CCMSLIB00000075073) Stenothricin C M+H has a 0.04 cosine match with Stenothricin B M+H (CCMSLIB00000075072) Stenothricin C M+H has a 0.1 cosine match with Stenothricin D M+H (CCMSLIB00000075071) Stenothricin C M+H has a 0.03 cosine match with Stenothricin A M+H (CCMSLIB00000075070) If we happen to know the spectrumids of some other spectra in the Stenothricin cluster, we can find the similarity to each as follows: Because we used an example from the dataset as a query spectrum, we get one perfect match (with itself) and then some other close matches with Hydroxysprengerinin C and Stenothricin G. We print out the similarity between our “query” spectrum (Stenothricin C) and each of the top ten matches. In the for loop, we use to reverse the list as the more similar spectra are towards the end. Stenothricin C M+H has a 0.42 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006456701) Stenothricin C M+H has a 0.43 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006456917) Stenothricin C M+H has a 0.44 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006455914) Stenothricin C M+H has a 0.46 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006456712) Stenothricin C M+H has a 0.47 cosine match with 20(S)-Ginsenoside F2 + (CCMSLIB00006580075) Stenothricin C M+H has a 0.47 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006456985) ![]() Stenothricin C M+H has a 0.54 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006456977) Stenothricin C M+H has a 0.64 cosine match with Stenothricin G M+H (CCMSLIB00000075069) Stenothricin C M+H has a 0.76 cosine match with 14-hydroxysprengerinin C + (CCMSLIB00006456969) Stenothricin C M+H has a 1.0 cosine match with Stenothricin C M+H (CCMSLIB00000075068) To pull it out, you can loop through the specs variable until you find it as follows: You can use Stenothricin C, ID CCMSLIB00000075068, as a starting point. Matchms includes built-in functionality to find the similarity between given spectra by looking at the cosine distance between the peak data. Stenothricin spectra form an interesting gene cluster. Now you’re more familiar with matchms and MGF files, let’s walk through using matchms to find clusters of similar spectra. If you don’t have wget, install it first with: As in the previous tutorial, we'll use wget again to view progress as the download is quite large. Use Omigami to find similar spectra, using the Spec2Vec and MS2DeepScore algorithms.Use matchms and Cosine to find similar spectra, with an example spectrum.Install matchms and use it to read the MGF file.Look at the MGF file and see how it compares to JSON.You'll need a computer with at least 8GB of RAM to comfortably load the entire GNPS dataset into memory. You should also be comfortable using pip or conda to install third-party Python packages. To follow along, you should have Python and Jupyter Notebook installed. And you’ll be more familiar with data formats and tools built specifically for metabolomics data analysis. This will get you started with finding clusters of similar spectra. And instead of Pandas, we'll use the specialized Mass Spectrometry libraries matchms (for Cosine distance) and Omigami (for Spec2Vec and MS2DeepScore) to find similar spectra. We’ll demonstrate an alternative method in this article. json file and showed you how to clean and analyze the data using Python and Pandas. In a previous article, GNPS Data with Python and Pandas, we downloaded the GNPS dataset as a. ![]()
0 Comments
Leave a Reply. |