Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Contrast Profile Tutorial #800

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

ken-maeda
Copy link
Contributor

Pull Request Checklist

#465 reproducing the paper for tutorial.

Below is a simple checklist but please do not hesitate to ask for assistance!

  • Fork, clone, and checkout the newest version of the code
  • Create a new branch
  • Make necessary code changes
  • Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • Run black . in the root stumpy directory
  • Run flake8 . in the root stumpy directory
  • Run ./setup.sh && ./test.sh in the root stumpy directory
  • Reference a Github issue (and create one if one doesn't already exist)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.13 ⚠️

Comparison is base (a4bb1e1) 99.25% compared to head (5407968) 99.12%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
- Coverage   99.25%   99.12%   -0.13%     
==========================================
  Files          82       83       +1     
  Lines       13101    13898     +797     
==========================================
+ Hits        13003    13776     +773     
- Misses         98      122      +24     

see 50 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 3, 2023

@ken-maeda Thank you for this contribution. Please allow me some time to review

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 7, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:35Z
----------------------------------------------------------------

T(+) requires at least two behaviors.

Can you explain why it requires at least two behaviors?

Maybe "behaviors" isn't the right word and you mean "at least two instances of the positive case"?


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 7, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:36Z
----------------------------------------------------------------

Line #1.    ecg_df = pd.read_csv("14172m.csv", index_col=0)

Instead of repeating astype(float) so many times later, you should just do:

ecg_df = pd.read_csv("14172m.csv", index_col=0, usecols=[1]).astype(float) 

Also, it would be nice just to show or print out what is in ecg_df.head() . What does the dataframe look like?



@review-notebook-app
Copy link

review-notebook-app bot commented Mar 7, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:37Z
----------------------------------------------------------------

Can we add some comments on what we are looking at? Why does the bottom look so much more regular and with a repeated pattern?


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 7, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:39Z
----------------------------------------------------------------

Line #1.    v_query = ecg_df.iloc[5930:5930+127, 1].values.astype(float)

Why is the window size 127 and not 128?


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 7, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:40Z
----------------------------------------------------------------

I don't understand the significance of this. Your point isn't clear. Where did v_query come from? Why should the reader care about that?

It would be useful to discuss the bottom plot (distance profile) and how to interpret it.

Why is it useful/important to show distance profile?


ken-maeda commented on 2023-03-07T12:08:54Z
----------------------------------------------------------------

v_query is typical norml ecg query we can see everywhere in dataset. so it is picked up one randomly.

The purpose of this distance profile is finding desired behavior by just comparing v_query(typical ecg signal) with desired behavior(rare signal). As assumption, it could be highest in distance profile. But it didn't happen.

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 7, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-03-07T11:21:41Z
----------------------------------------------------------------

What is "plato"?


ken-maeda commented on 2023-03-07T12:10:04Z
----------------------------------------------------------------

I add more descriptino to


Contrast Profile

The subsequence in 𝐓(+) corresponding to the highest point in the Contrast Profile is called the Plato.

Copy link
Contributor Author

v_query is typical norml ecg query we can see everywhere in dataset. so it is picked up one randomly.

The purpose of this distance profile is finding desired behavior by just comparing v_query(typical ecg signal) with desired behavior(rare signal). As assumption, it could be highest in distance profile. But it didn't happen.


View entire conversation on ReviewNB

Copy link
Contributor Author

I add more descriptino to


Contrast Profile

The subsequence in 𝐓(+) corresponding to the highest point in the Contrast Profile is called the Plato.


View entire conversation on ReviewNB

@ken-maeda
Copy link
Contributor Author

I appriciate your feedback, I fixed those.

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Mar 8, 2023

@ken-maeda
I have a suggestion for you.

I think it is better to develop the notebook section by section. So, for each section that you add, you can wait to get some feedback and then apply those, and then after getting the green light, you can move forward. Right now, you may know what is going on in your notebook, however, the main goal is to make sure the reader can understand what is going on! You can keep the current notebook somewhere in your local pc. Then, you can start again by just providing the first section or a couple of sections. So, your notebook should only contain a couple of parts in the beginning. Then, you can add sections to it step by step.

Currently, if I do not understand a part of your notebook, I try to read other parts to better understand the concept. However, this is not desirable. I think this is a red flag. There should be a flow in your tutorial and I believe each segment should be understandable on its own.

Also, the text is as important as the code. In fact, I think it is more important particularly in tutorials. So, try to be extra careful when you explain a concept. You want to be crystal clear in every single step as much as possible.

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Mar 8, 2023

Regarding contrast profile, this is how I see it:

we can see each subsequence of length m as a data point in $R^{m}$ space. For the sake of visualization, let's illustrate the problem in 2D space.

First, let's review the definitions of T(+) and T(-).

T(+) : contains at least two instances that are unique to the phenomena of interest.
T(-) : contains no instances of interest (and instead, I think we should say it contains the regular, obvious patterns in `T(+)`)

Before we talk about T(-) and T(+), it is better to just talk about T. Let's assume the figure below shows the subsequences of T.

image

If I look at this data, I can see that the regular, obvious pattern is where the crowded part is. However, the motif we might be interested in can be the motif pair (A, B). Note that this motif pair may not be easily captured as their distance is greater than the distance of any other point and its nearest neighbour.

So, what can we do? We can create T(-) which just contains the regular behaviour of our Data. We then use T(+) to denote the remaining ones.

image

Now, we can see that the d = dist(A, A_nn_in_Tneg) - dist(A, A_nn_in_Tpos) has a high value. Let's call it contrast distance. The contrast profile, cp, is an array where cp[i] is the contrast distance that corresponds to the i-th subsequence in T(+). The peak of this contrast profile can reveal the motif pair (A, B).


Question: Can we see it as twin-freak problem? In other words, this might be an anomaly that appears more than once. So, we can easily find the motif pair (A, B) by finding the subsequence that has the greatest distance to its 2nd nearest neighbour.

Answer: I do not know! I think a good way to investigate this is to get some data and find a pair of subsets using each of these two approaches: (1) twin-freak (2) contrast profile, and see if they result in different outcomes.

@ken-maeda
Copy link
Contributor Author

@NimaSarajpoor
I'm sorry for causing trouble, and greatful your kind guidance. I uploaded new notebook first section of Tutorial_Contrast_Profile2.ipynb. I should have considered the contrast profile concept more.

I added its scatter plot should best to understand the "constrat" concept to the notebook.

T(+) : contains at least two instances that are unique to the phenomena of interest.
T(-) : contains no instances of interest (and instead, I think we should say it contains the regular, obvious patterns in T(+))

This might be tricky precondition, the robustness for this precondition also is argued. I thought the notebook should be enoguh only for expalining contrast profile conceptl.

@@ -0,0 +1,249 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small values in the Matrix Profile are called motifs, and large values are called discords.

I can see that this sentence is from the introduction part of the paper; however, I think it is not meaningful to call small value in matrix profile, a motif. Because a motif is a subsequence (NOT a value) whose distance to its nearest neighbour is small.

I think the authors provided a better definition in the abstract of the paper:

"Time series motifs refer to two particularly close subsequences, whereas time series discords indicate subsequences that are far from their nearest neighbors."

It may be usefull to score subsequences with

I think the paper is clearer here. In the paper, it says: "....score subsequences with a meta-data that reflects that ..."

Also, I think it would be a good idea to introduce T(+) and T(-) here.

This is exactly the property we desire.

Why? I mean how can we get benefit from it? According to my understanding and what I read in the paper, this property can be used to find a subsequence that can uniquely identify a class. In other words, this can be used in classification. Is that correct? I think it would be nice to provide an example to explain the significance of contrast profile. I know there is an example in paper (see Fig. 1), but I think it is a little bit complicated. So, let's try and see if we can come up with a good example to show the importance of contrast profile. Think about it.


Reply via ReviewNB

@@ -0,0 +1,249 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's walk through by running example of a noisy electrocardiogram(ECG).

Maybe remove this line.

We proposed to compute the Contrast Profile only when we belive that the two following assumptions are likely to be true:

Before we start talking about the contrast profile, I think it would be a nice idea to show the data and talk about the problem. What is the problem? What are we trying to find here?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we start talking about the contrast profile, I think it would be a nice idea to show the data and talk about the problem. What is the problem? What are we trying to find here?

Add simple example in the beginning

@@ -0,0 +1,249 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use column 0 later? If not, then it might be a good idea to just read the csv for that column only. So, we can do:

T = pd.read_csv(..., usecols=["1"]).to_numpy(np.float64)


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the dataset itself, it was originally exstracted dataset.

@@ -0,0 +1,249 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason behind using the name v here? Can we use T instead? Like... T_pos and T_neg ?

ALSO: How do we know the indices 63630, 68129, etc? This may create confusion. Reading those indices made me think that using contrast profile requires us having some prior knowledge about those indices. If that is not the case, then can you provide a brief explanation on how one can create the datasets T(+) and T(-) in a real-world problem?

If you think that needs to have its own section, then you might explain that you are considering those indices to just show an example here.

Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason behind using the name v here? Can we use T instead? Like... T_pos and T_neg 

Fixed

@@ -0,0 +1,249 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Why are they desirable? Can we add some explanation as why one might be interested in finding those patterns? What do these patterns mean? In other words, what is the target class here?

(2) The code says label="desired instances") However, the figure's legend says "desired behaviour"

(3) Right after this block, it might be useful to compute matrix profile, and discover motif and discord, and show that the discovered motifs/discords do not reveal the patterns we are looking for.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1) add discription

2) fixed

3) add matrix profile section

@@ -0,0 +1,249 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T(+) looks noisy, 𝐓(+) has to include contains at least two instances

Please rewrite it as follows: T(+) contains at least two instances that are of our interest.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIx

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Mar 9, 2023

@ken-maeda

I'm sorry for causing trouble, and greatful your kind guidance.

No need to be sorry. I provided a few comments. Let's start with those. Please do not add any new section. Let's take care of the current sections first. Please feel free to discuss something if you feel there is a need for that.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 10, 2023

@ken-maeda Instead of uploading a .png file, I would prefer if you could add the code that could create/recreate the image and have it inside the notebook

@ken-maeda ken-maeda requested a review from NimaSarajpoor March 23, 2023 15:38
@@ -0,0 +1,511 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... patterns(PVCs) ...

... patterns (PVCs) ... [Notice the space I added]

Also, what is PVC? should reader know about this? If not, then you can/should remove it. If yes, then you should avoid using abbreviation unless you are going to use it again later. Even in such case, you should first provide the original phrase.

Those are target we try to find.

Those are the targets we try to find.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the abbreviation.

docs/Tutorial_Contrast_Profile2.ipynb Show resolved Hide resolved
@@ -0,0 +1,511 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job in providing a clear plot!


Reply via ReviewNB

@@ -0,0 +1,511 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can find clealy the discord by Matrix profile. It indicate the discord in Matrix profile by far.

Maybe we should say: "In this case, the discord, indicated by matrix profile, is what we are looking for."

What do you think? Because, any matrix profile with a unique global maxima can reveal a discord. Whether or not that discord is what we are looking for is a different story. Here, we use matrix profile to find discord, and we are lucky that the discovered discord is what we are looking for.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you

@@ -0,0 +1,511 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrix profile indicates slided wrong place as discords.

As follow up to what I explained in my comment above, I think we should rephrase it. We may say: "The discord discovered by matrix profile is not what are looking for."

I am just trying to avoid using the word "wrong" here. In my head, contrast matrix profile is another tool that can be used to reveal something. Wether that thing is what we are loooking for or not depends on our perspective and our goal. In this case, we will probably see that contrast abstract profile can reveal something that is not captured by matrix profile. But, is that always useful? Well, that depends on the domain. In other words, a user does not know if the discord discovered by matrix profile is useful. The same goes for contrast matrix profile. A user can compute both and then investigate their outcome to see which reveals better information about the data.

ost important thing is Matrix profile also doesn't indicate remarkable distance for the discord.

How can one know if it is a remarkable distance for discord or not? I think this sentence is a little bit biased. This is because we already saw the matrix profile computed for the first half of Tand we now feel the maximum distance shown in figure above is not large enough. So, I think we should remove this sentence as we prefer to not be subjective in our analysis.

=> Because similar anomalies are nearest neighbor relationship.

Maybe we should say this first:

"Because the targets we are looking for are similar to each other, they can be nearest neighbor of each other and hence their corresponding distance in matrix profile will be small. Therefore, none of them will be detected via matrix profile"

We find those similar anomalies from the dataset with those following specification.

Now, I think this is the place that we should do our best and provide a clear explanation. We can say: "To find the desirable anomalies, we should first understand what properties distinguish them from the other patterns. Note that these two anomalies are similar to each other but they are dissimmilar to other (regular) patterns. This is the main concept behind the contrast matrix profile."

This is exactly the property we desire. In other words, we need to prepare two following data.

Then you can create a section here and name it "Contrast matrix profile". Then, we say:

"To compute the contrast matrix profile, two time series data are needed as follows: "

Take a look at the plot below to confirm the condictions

Are we showing any important information here? If not, then we can just remove this line and the code you have provided for plotting the two time series T_p and T_n


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced sententce with your suggested contents.

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Mar 24, 2023

@ken-maeda
I reviewed up to the section Loading the ECG data for Contrast Profile. Whenever you address a comment, you can go to ReviewNB (see the top of this PR) and then click on "Resolve Conversation" whenever you are done with that comment.


I think you have done a great job so far, and the notebook becomes more and more clear.

@ken-maeda
Copy link
Contributor Author

@NimaSarajpoor I appriciate your kind feedback, I updated the notebook markdown.

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Apr 12, 2023

@ken-maeda
From my point of view, things are good so far. My only comment is the last figure as I think it is a little bit crowded (and I am not sure if there is a better way to break it down or not). Also, you may want to revise the last sentence. I think the last sentence talks about the desirable patterns but what you should have discussed is that the discovered discords via matrix profile (shown in "orange") is not desirable and someting like that.

@ken-maeda
Copy link
Contributor Author

@NimaSarajpoor I changed to separate last plot(maybe redundant?), as you mentioned, it was crowded it is hard to recognized where is indicated. I fixed the last sentence also.

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change

If a dataset has similar more than one discords, we should know how those discords are calculated in Matrix Profile.

to:

If a dataset has two similar subsequences that are far from the rest of subsequences, they may not be discovered as motif or discord by matrixprofile.


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see the entire dataset first.


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the line Define index of desired instances on the purpose?


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.    desired0_idx, desired1_idx = 550, 2030

Please move this line to the top of the next code cell, where you show the desired instances. Also, please add one blank line after it (in the next code cell) for sake of readability.


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Is there extra space between the first two sentences?

(2)According to..., note that I replaced the dot with comma

(3) please modify the second sentence as follows:

.... of premature venticular contractions, with start index of 580 and 2030 (see figure below).

(4) please modify the third sentence as follows:

Those are the two instances we would like to discover.

Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace upper with top


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should replace the window with a vertical line in the matrix profile figure. For example, see the section Find potential anomalies (discords) using stump in this page: https://stumpy.readthedocs.io/en/latest/Tutorial_STUMPY_Basics.html


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace this sentence with:

"As shown in this figure, the motifs discovered by the matrix profile (of T_p) does not reveal our two desirble subsequences, the ones that were shown in the previous figure. "

I think this can help readers understand why we need to use contrast profile technique later to discover the two desirbale subsequences.


Reply via ReviewNB

@@ -0,0 +1,498 @@
{
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I can better understand this figure. I think what you are trying to say is that the discord discovered by matrix profile is not what we are looking for. I suggest to do the following instead:

1 - compute the discord index and its nearest neighbor using matrix profile.

idx = np.argmax(MP_PP[:,0])  # start index of discord
nn_idx = MP_PP[:,1]  # start index of nearest neighbor of discord

2- show two figures only, and both of them should be T(+). In one, you plot T(+) and show the discord and its nearest neighbor subsequences, according to the computed idx andnn_idx. In another figure, you plot T(+) again but you show the desirbale subsequences instead.

Please choose a proper title for each figure. For the first figure, we may say: "the discord and its nearest neighbor discovered by the matrix profile", and for the second figure, we may say: "the desirbale subsequences of our interest"

Now, instead of four figures, we only show two figrues, and I think it should be easily understandable. Note that the figure of matrix profile itself (the last two figures shown above) does not help that much. The third figure only shows where the matrix profile is maximum . and the fourth figure might be confusing. So, I think using two lines of code is better than showing two more figures here.


Reply via ReviewNB

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Apr 23, 2023

@ken-maeda
I provided my final touch on the notebook. After addressing the comments, we should see what @seanlaw thinks about this notebook.

If everything is okay, we can then move forward and add the part where you compute the contrast profile and show that it can discover the subsequences of our interest.

@ken-maeda
How do you feel about the progress so far?

@ken-maeda
Copy link
Contributor Author

@NimaSarajpoor
I'm sorry for the delay, I have fixed the notebook in the point you mentioned. I hope the notebook I created is fine now.

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented May 15, 2023

@ken-maeda
Thank you for addressing the comments. While there is still some room for improvement, we can do it later. I think @seanlaw can take a look at the notebook now and see if he has any opinion / suggestion.

@seanlaw
Do you have any comment on the second notebook, docs/Tutorial_Contrast_Profile2.ipynb ?

@seanlaw
Copy link
Contributor

seanlaw commented May 15, 2023

Let me find some time to provide some comments

@@ -0,0 +1,485 @@
{
Copy link
Contributor

@seanlaw seanlaw May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There seems to be an extra space at the beginning of the title " Novel Time Series...". Please remove the space
  2. "matrix profile" are two separate words, not one. Please add a space
  3. If a dataset has two similar subsequences that are far from the rest of subsequences Can you please elaborate on this sentence? I am not getting your point. If they are similar enough then they may not be the top motif but it is possible to be, say, in the top 50 motifs?
  4. Usually, in the opening paragraph, we want to try an explain what the problem is that contrast profiles can help solve. Can you clearly explain the problem? It is okay to borrow from the published paper

Reply via ReviewNB

@@ -0,0 +1,485 @@
{
Copy link
Contributor

@seanlaw seanlaw May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please place the ECG data into one of the comments of the PR and then link to that uploaded CSV?


Reply via ReviewNB

@@ -0,0 +1,485 @@
{
Copy link
Contributor

@seanlaw seanlaw May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is correct to call the grey areas "anomalies". Are they ever referred to as anomalies in the paper?

Edit: Below, you use the term "phenomena" and I think this is reasonable and suitable instead of "anomaly".


Reply via ReviewNB

@@ -0,0 +1,485 @@
{
Copy link
Contributor

@seanlaw seanlaw May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set normalize=False? Did they do that in the original paper?

Why are we looking for the discord index first? It seems like the wrong order of steps. I feel like the first thing that somebody would do is to:

  1. Apply stump to the entire time series
  2. Then they realize that the "interesting" subsequence pair isn't in their top motif(s)
  3. Finally, we explain why a naive approach is insufficient to find the "interesting" subsequence pair

Then, we can motivate the question, "How would/could we find the "interesting" subsequence pair?" and then provide clear and concrete steps on how to do it. Otherwise, this section seems out of place or presented too soon.

I think we have to think about "what is the simplest thing that the user would have tried?" and then, when that fails, we build upon that knowledge and move past it with some better suggestion(s)


Reply via ReviewNB

docs/Tutorial_Contrast_Profile2.ipynb Show resolved Hide resolved
@@ -0,0 +1,485 @@
{
Copy link
Contributor

@seanlaw seanlaw May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only showing the first 2 rows? What is the point?


Reply via ReviewNB

@ken-maeda
Copy link
Contributor Author

Apply stump to the entire time series
what is the simplest thing that the user would have tried?

Regarding the point whether it is natural to calculate the Matrix Profile for the entire series first, the characteristic we are trying to find this time is neither motif nor discord, so I feel there is no motivation to calculate stump directly. Therefore, even if you simply calculate it for the entire series, I don't think there is much to say from the event itself, so I compared when there is a single discord and when there are multiple discords, and built on what can be said from there.
It might be better to make it easy to understand what I'm trying to do from the very beginning.

@seanlaw
Copy link
Contributor

seanlaw commented May 21, 2023

Regarding the point whether it is natural to calculate the Matrix Profile for the entire series first, the characteristic we are trying to find this time is neither motif nor discord, so I feel there is no motivation to calculate stump directly. Therefore, even if you simply calculate it for the entire series, I don't think there is much to say from the event itself, so I compared when there is a single discord and when there are multiple discords, and built on what can be said from there.
It might be better to make it easy to understand what I'm trying to do from the very beginning.

I think the point is that stump will not be able to help you here precisely because the subsequence is neither a top motif or a top discord. Certainly, if you traverse down to the top-N motifs then you might eventually find it. I think it is important to motivate "why" computing the full matrix profile is not enough and also demonstrate its ineffectiveness for this particular problem (i.e., when the subsequence of interest does not have a nearest neighbor that is as close as other motifs)

@ken-maeda
Copy link
Contributor Author

I think it is important to motivate "why" computing the full matrix profile is not enough and also demonstrate its ineffectiveness for this particular problem

  1. It is challenging to set a goal prior to calculating the Matrix Profile across the whole data.
    When we have a signal pattern we want to find, searching for motifs or discords with the Matrix Profile may not seem natural. Users might question what to do in such cases and may find it unnatural to take action using the naive Matrix Profile.

  2. Analyzing the result of applying the Matrix Profile to the whole data is difficult.
    As you mentioned, whether we can find what we're looking for largely depends on parts of the signal other than the current characteristic. Therefore, it's hard to say from the results what would be better from the perspective of the current characteristic .

  3. Elements that should be explained in the introduction and elements that can be explained.
    I want to determine that.
    Currently, the overall flow is:
    3-1. If a discord is included once, it can be found.
    3-2. If a discord is included twice, it cannot be found.
    3-3. So, what should we do?
    Regarding this flow, I thought that it would be better to write more concisely at the beginning about what the Contrast Profile brings. However, what do you think should be written in the introduction?

@seanlaw
Copy link
Contributor

seanlaw commented May 23, 2023

3-1. If a discord is included once, it can be found.
3-2. If a discord is included twice, it cannot be found.
3-3. So, what should we do?
Regarding this flow, I thought that it would be better to write more concisely at the beginning about what the Contrast Profile brings. However, what do you think should be written in the introduction?

If you look at the comments, I don't think "discord" or "anomaly" is the right word here as a discord is referring to "a subsequence that has a one-nearest neighbor that is very far away". In our example, the one nearest neighbor may not necessarily be very far away and, instead, it is that the subsequence of interest isn't discovered in the first few motifs. The paper refers to them as "phenomena of interest" and not "discord"

When I look at the contrast profile paper, it presents the problem as:

  1. Imagine that you have a time series that usually has a reasonably well defined set of one or more repeating patterns and other subsequences are not expected
  2. Then, something occurs that induces a new subsequence (a phenomena) that has never been observed before AND it isn't quite a discord because the phenomena has repeated itself at least once more and while this nearest neighbor might be "close" to the first manifestation, it isn't as "close" to the nearest neighbor of previously known motifs (i.e., the repeating patterns in 1.)

So, you have some historical data where everything is well known and then you encounter an event that causes a new, never seen before subsequence AND it happens a second time. And so it becomes obvious to ask the question, "given how rare the phenomena is, how might we go about detecting it in a systematic way?"

The first (naive) thing that you might try is to compute the matrix profile for the entire time series and then iterate through the top-N motifs via stumpy.motif and then consider how similar the ith motif is to the first i-1 motifs. However, there is a "better" way... Then introduce the contrast profile (including its assumptions, what it aims to do, and what it isn't/can't do)

@ken-maeda
Copy link
Contributor Author

What I'm trying to say is that it doesn't make sense to me why we're calculating the matrix profile for the entire series when there's not enough reason to expect that the phenomena we're trying to discover will be detected as motifs. It feels like we're just saying, 'We tried calculating it and this is what happened.'

So, you have some historical data where everything is well known and then you encounter an event that causes a new, never seen before subsequence AND it happens a second time. And so it becomes obvious to ask the question, "given how rare the phenomena is, how might we go about detecting it in a systematic way?"

In this way, the phenomenon we wish to find this time is mentioned, but I believe no mention is made of the relationship with other intervals or motifs that occur in other intervals. Under these circumstances, would a user try to find it as a motif?

Particularly in this case, where other parts clearly have repetitive subsequence, it feels unnatural to calculate motifs to find the phenomena.

.

@seanlaw
Copy link
Contributor

seanlaw commented May 30, 2023

Particularly in this case, where other parts clearly have repetitive subsequence, it feels unnatural to calculate motifs to find the phenomena.

@ken-maeda If this is the case, you really need to explain this fact about the data at the beginning when you present the data. I don't think it is enough to assume that the reader will notice the repetition. Instead, you must point out the facts even if it is obvious. Then, you should explicitly explain that computing the full matrix profile won't really help you find the phenomena and why that is. Better yet, I was advocating that you just compute the matrix profile and show its limitations in that the phenomena won't be captured easily by the matrix profile (or using the motifs function).

@ken-maeda
Copy link
Contributor Author

As you suggested, I tried it. Does this seem to be okay?

top-motif location by motifs
image

image

We were able to find something close to the motif we want to find as the top 10th motif.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 7, 2023

As you suggested, I tried it. Does this seem to be okay? We were able to find something close to the motif we want to find as the top 10th motif.

Yes! What code did you use to get that information? Was it stumpy.stump followed by stumpy.motifs?

So, I think it's important to explicitly point out/motivate that, in this example, we only know to search to the 10th motif because we are lucky enough to see/visualize where the phenomena is but we have no idea if it will be the 10th motif or the 100th motif. In the real world, the phenomena might be much harder to spot with the naked eye because the repetition within data might be much longer/noisier. And so we need a different way to solve this problem.

Maybe a poor-man's version would be to compare ith motif with the all of the motifs that were discovered before it and ask how different/similar it is? But this is painful and more art than science.

@ken-maeda
Copy link
Contributor Author

mp = stumpy.stump(T, m, normalize=False)
motifs = stumpy.motifs(T, mp[:, 0])

I simply calculated the above example from the motifs.
Considering the following flow for the introduction, how far do you think it's appropriate to explain?

1."By naively using stump, you can discover motifs and discords.
2.In this case, the signal we want to find are similar, so it should be possible to findthem as some top-motif when we compute stump.
3.We are calculating stump, but the signal we want to find have clear characteristics in their size(amplitude), so we calculate with normalization set to False, which is True as the default in stump.
4.It could be calculated as the 10th top-motif.
5.However, in reality, we don't know as which top-motif it will be calculated. There is a more reliable way to find signal with these characteristics. This can be achieved with a concept called Contrast Profile, which I would like to explain in detail."

@seanlaw
Copy link
Contributor

seanlaw commented Jun 12, 2023

@ken-maeda I think this level of detail is more than fine. However, I think it is important to point out that there is repetition in the data and so finding the "phenomena" (only repeated once) is challenging since the majority of the top motifs will look like the repeated pattern.

@ken-maeda
Copy link
Contributor Author

@seanlaw OK, I'm going to create another notebook from the introduction, following your new advice.

@@ -0,0 +1,226 @@
{
Copy link
Contributor

@seanlaw seanlaw Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of our tutorials, we try to reproduce specific figures from the original paper (without alteration). Which figure is this reproducing from the paper? This looks different Figure 4. Did you alter the time series? If yes, can you please use the same time series as provided in the original paper? The goal here is reproducibility of the work.

Also, the original paper does not say anything about setting normalize=False and so we probably shouldn't confuse the reader by doing it here either.


Reply via ReviewNB

Copy link
Contributor Author

@ken-maeda ken-maeda Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction I'm currently creating is a reproduction of Figure 2, not Figure 4. We have discussed up to this point that the original figure is abstract and hard to understand, so it's necessary to explain what it means with actual signal. Therefore, I'm using parts from the original data that were not used in the paper but are suitable for explanation this time. So, the normalization issue is not something to be referenced from the paper. If that is a problem, I think it could have been pointed out before creation.I checked the flow of the introduction above to prevent such confusion... and inconsistent with past disscussion.

As for FIgure 4 and other example, I have created it according to the paper in the notebooks until now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy