Monitoring Range Motif on Streaming Time-Series, presented at DEXA 2018

Monitoring Range Motif on Streaming Time-Series Shinya Kato, Daichi Amagata,
Shunya Nishio, and Takahiro Hara Osaka University, Japan

Table of contents ⚫Background ⚫Baseline algorithm ⚫Proposed algorithm SRMM ⚫Experiment
⚫Conclusion 1

Background (1/2) ⚫Recently, many time-series data have been collected. 2
Power consumption of home appliances Emissions of greenhouse gases Electrocardiogram Anomaly detection Environment monitoring Discovery of arrhythmia Analyze

Background (2/2) ⚫Range motif - One of the most important
tools for analyzing time-series - A subsequence that appears in time-series repeatedly 3 Range motif Time-series

Application example ⚫Event detection - Assume that you save the
motif everyday 4 1 day ago 2 days ago 3 days ago It can be expected that there is an anomaly event. now Very different from past motifs

Problem definition ⚫Range motif monitoring on a streaming time-series under
a count-based sliding window setting - When the window slides, a new value is inserted and the oldest value is removed. - Consider only the most recent 𝑤 values 5 time most recent 𝑤 values old values aren’t considered.

Preliminary ⚫Streaming time-series 𝑡 - 𝑡 = 𝑡 1 ,
𝑡 2 , ⋯ ⚫Subsequence - 𝑠𝑝 = 𝑡 𝑝 , 𝑡 𝑝 + 1 , ⋯ , 𝑡 𝑝 + 𝑙 − 1 ⚫Pearson correlation - 𝜌 𝑠𝑝 , 𝑠𝑞 = σ𝑖=0 𝑙−1 𝑡 𝑝+𝑖 𝑡[𝑞+𝑖]−𝑙𝜇𝑝𝜇𝑞 𝑙𝜎𝑝𝜎𝑞 - Relationship with z-normalized euclidean distance [1] 𝑑 Ƹ 𝑠𝑝 , Ƹ 𝑠𝑞 = 2𝑙 1 − 𝜌 𝑠𝑝 , 𝑠𝑞 6 [1] Mueen, A.: Time series join on subsequence correlation, ICDM (2014) 𝜇𝑝 : mean of 𝑠𝑝 𝜎𝑝 : standard deviation of 𝑠𝑝 time 𝑡 𝑝 𝑙 𝑠𝑝

Preliminary ⚫Similar subsequence - 𝑠𝑝 (𝑠𝑞 ) is similar to
𝑠𝑞 (𝑠𝑝 ) if 𝜌 𝑠𝑝 , 𝑠𝑞 ≥ 𝜃 ⟺ 𝑑 Ƹ 𝑠𝑝 , Ƹ 𝑠𝑞 ≤ 2𝑙 1 − 𝜃 ⚫Score - 𝑠𝑐𝑜𝑟𝑒(𝑠𝑝 ) is the number of subsequences similar to 𝑠𝑝 . ⚫Range motif 𝑠∗ [2] - Subsequence with the highest score 7 [2] Patel, P.: Mining motifs in massive time series databases, ICDM (2003) 𝑠𝑝 𝑠𝑞 𝑠𝑟 𝜌 𝑠𝑝 , 𝑠𝑞 ≥ 𝜃 𝜌 𝑠𝑝 , 𝑠𝑟 ≥ 𝜃 𝒔𝒄𝒐𝒓𝒆(𝒔𝒑 ) = 𝟐

Baseline algorithm ⚫By computing the similarity all subsequences and the
expired new subsequence, Baseline algorithm updates the scores of all subsequences. 8 expired subsequence new subsequence Window new value observed Window old value deleted Compute Pearson correlation

Problem & Research goal ⚫Problem - Time complexity of Pearson
correlation is 𝑂(𝑙). - The number of computation is 𝑤 − 𝑙 times. - Time complexity of Baseline algorithm is 𝑶 𝒘 − 𝒍 𝒍 . ⚫Research goal - When the window slides, speeding up the update time of the score and monitoring a motif efficiently. 9 We propose the algorithm “SRMM” (Streaming Range Motif Monitoring).

SRMM (new subsequence 𝒔𝒏) - Overview1 ⚫SRMM maintains dimensional reduced
subsequences by PAA [3] in window by a kd-tree. 10 [3] Keogh, E.: Dimensionality reduction for fast similarity search in large time series databases, KIS (2002) subsequences in window dimensional reduced subsequences Mapping trick Maintain by a kd-tree PAA 𝜙-dimensional space 𝜙 𝑙

SRMM (new subsequence 𝒔𝒏) - Overview2 ⚫If we can know
"𝒔𝒄𝒐𝒓𝒆 𝒔𝒏 < 𝒔𝒄𝒐𝒓𝒆(𝒔∗)" quickly, we can efficiently monitor the exact motif. - We propose a technique that obtains 𝒔𝒄𝒐𝒓𝒆𝒖𝒃 𝒔𝒏 (upper-bound of 𝒔𝒄𝒐𝒓𝒆(𝒔𝒏 )) efficiently. - It prunes the unnecessary exact score computation. ⚫Flow of SRMM 11 PAA Insert into a kd-tree Range search Get 𝒔𝒄𝒐𝒓𝒆𝒖𝒃 𝒔𝒏 𝑠𝑛 𝑠𝑛 𝜙 𝑙 𝜙

SRMM (new subsequence 𝒔𝒏) - PAA ⚫PAA (Piecewise Aggregate Approximation)
- A dimensionality reduction algorithm - Separate a time-series into segments, and get mean of values in segments. 12 Compute mean Compute mean Compute mean Compute mean Compute mean Compute mean Before transformed After transformed

SRMM (new subsequence 𝒔𝒏) - PAA ⚫To prune the exact
distance computation, we use PAA. - Use the property that the distance between transformed subsequences become smaller 13 PAA ≥ ≥ 2𝜙(1 − 𝜃) 𝑂(𝑙) 𝑂(𝜙) We know 𝒔𝒑 and 𝒔𝒒 are not similar in 𝑶(𝝓). 𝑙 𝑠𝑝 𝑠𝑞 𝑑(𝑠𝑝 , 𝑠𝑞 ) 𝜙 𝑠𝑝 𝜙 𝑠𝑞 𝜙 𝑑(𝑠𝑝 𝜙, 𝑠𝑞 𝜙)

SRMM (new subsequence 𝒔𝒏) - Mapping trick ⚫Subsequences of length
𝜙 can be regarded as a point on a 𝜙-dimensional space. - Subsequences with large Pearson correlation are close on the 𝜙-dimensional space. - There are all candidates for similar subsequences within the distance 2𝜙(1 − 𝜃) on the 𝜙-dimensional space. 14 By range search, we can get 𝒔𝒄𝒐𝒓𝒆𝒖𝒃 and candidates for similar subsequences 𝜙-dimensional space 𝑠𝑛 𝜙 2𝜙(1 − 𝜃)

SRMM (new subsequence 𝒔𝒏) - kd-tree ⚫Maintain transformed subsequences by
a kd-tree - Range search using a kd-tree is fast (log order). - The number of subsequences in the range is 𝑠𝑐𝑜𝑟𝑒𝑢𝑏 𝑠𝑛 . 15 𝜙-dimensional space 𝑠𝑛 𝜙 Without using a kd-tree 𝑂 𝜙 𝑤 − 𝑙 Range search using a kd-tree 𝑶 𝝓 𝐥𝐨𝐠 𝒘 − 𝒍 𝜙-dimensional space 𝑠𝑛 𝜙 2𝜙(1 − 𝜃)

SRMM (new subsequence 𝒔𝒏) - Pruning ⚫Compare 𝑠𝑐𝑜𝑟𝑒𝑢𝑏 𝑠𝑛 with
𝑠𝑐𝑜𝑟𝑒 𝑠∗ - If 𝑠𝑐𝑜𝑟𝑒𝑢𝑏 𝑠𝑛 < 𝑠𝑐𝑜𝑟𝑒 𝑠∗ then - If 𝑠𝑐𝑜𝑟𝑒𝑢𝑏 𝑠𝑛 ≥ 𝑠𝑐𝑜𝑟𝑒 𝑠∗ then 16 Because 𝒔𝒏 can be a motif, we must compute the exact 𝒔𝒄𝒐𝒓𝒆(𝒔𝒏 ). Because 𝒔𝒏 cannot be a motif, we can safely prune computation of 𝒔𝒄𝒐𝒓𝒆(𝒔𝒏 ).

SRMM (expired subsequence 𝒔𝒆) ⚫When the window slides, score of
subsequences which are similar to 𝑠𝑒 are decreased. - Each subsequence has the similar subsequence list. - Identify the subsequences whose score decrease 17 𝑠𝑒 𝜙 𝑠𝑝 𝜙 𝑠𝑞 𝜙 𝑠𝑟 𝜙 𝑠𝑒 : 𝑠𝑝 : 𝑠𝑞 : 𝑠𝑟 : Delete 𝑠𝑝 𝑠𝑞 𝑠𝑟 𝑠𝑒 ⋯ 𝑠𝑒 ⋯ 𝑠𝑒 ⋯ Delete Delete Identify the subsequences whose scores decrease

Experiment ⚫Dataset - GreenHouseGas - RefrigerationDevices ⚫Parameters ⚫Comparative algorithm -
Baseline algorithm ⚫Evaluation criterion - Update time: average time to update a motif by window sliding 18 Window-size, 𝑤 [× 103] 5, 10, 150, 200 Motif length, 𝑙 50, 100, 150, 200 Threshold, 𝜃 0.75, 0.8, 0.85, 0.9, 0.95

Result – Impact of Window-size 𝑤 19 0 20 40
60 80 5 10 15 20 Update time [msec] 𝑤 [×103] Baseline SRMM 0 20 40 60 80 5 10 15 20 Update time [msec] 𝑤 [×103] Baseline SRMM GreenHouseGas RefrigerationDevices SRMM is faster than Baseline.

Result – Impact of Motif length 𝒍 20 0 20
40 60 80 50 100 150 200 Update time [msec] 𝑙 Baseline SRMM 0 20 40 60 80 50 100 150 200 Update time [msec] 𝑙 Baseline SRMM GreenHouseGas RefrigerationDevices SRMM is not affected by 𝒍.

Result – Impact of Threshold 𝜽 21 0 20 40
60 0.75 0.8 0.85 0.9 0.95 Update time [msec] 𝜃 Baseline SRMM 0 20 40 60 0.75 0.8 0.85 0.9 0.95 Update time [msec] 𝜃 Baseline SRMM GreenHouseGas RefrigerationDevices SRMM is faster as 𝜽 increases.

Conclusion ⚫We have proposed the efficient algorithm SRMM to monitor
a range motif. - By using PAA and a kd-tree, unnecessary score computations are reduced. ⚫The results of our experiments show the efficiency and scalability. 22

Monitoring Range Motif on Streaming Time-Series...

Monitoring Range Motif on Streaming Time-Series, presented at DEXA 2018

Shinya Kato

More Decks by Shinya Kato

Other Decks in Research

Featured

Transcript

Monitoring Range Motif on Streaming Time-Series Shinya Kato, Daichi Amagata,

Table of contents ⚫Background ⚫Baseline algorithm ⚫Proposed algorithm SRMM ⚫Experiment

Background (1/2) ⚫Recently, many time-series data have been collected. 2

Background (2/2) ⚫Range motif - One of the most important

Application example ⚫Event detection - Assume that you save the

Problem definition ⚫Range motif monitoring on a streaming time-series under

Preliminary ⚫Streaming time-series 𝑡 - 𝑡 = 𝑡 1 ,

Preliminary ⚫Similar subsequence - 𝑠𝑝 (𝑠𝑞 ) is similar to

Baseline algorithm ⚫By computing the similarity all subsequences and the

Problem & Research goal ⚫Problem - Time complexity of Pearson

SRMM (new subsequence 𝒔𝒏) - Overview1 ⚫SRMM maintains dimensional reduced

SRMM (new subsequence 𝒔𝒏) - Overview2 ⚫If we can know

SRMM (new subsequence 𝒔𝒏) - PAA ⚫PAA (Piecewise Aggregate Approximation)

SRMM (new subsequence 𝒔𝒏) - PAA ⚫To prune the exact

SRMM (new subsequence 𝒔𝒏) - Mapping trick ⚫Subsequences of length

SRMM (new subsequence 𝒔𝒏) - kd-tree ⚫Maintain transformed subsequences by

SRMM (new subsequence 𝒔𝒏) - Pruning ⚫Compare 𝑠𝑐𝑜𝑟𝑒𝑢𝑏 𝑠𝑛 with

SRMM (expired subsequence 𝒔𝒆) ⚫When the window slides, score of

Experiment ⚫Dataset - GreenHouseGas - RefrigerationDevices ⚫Parameters ⚫Comparative algorithm -

Result – Impact of Window-size 𝑤 19 0 20 40

Result – Impact of Motif length 𝒍 20 0 20

Result – Impact of Threshold 𝜽 21 0 20 40

Conclusion ⚫We have proposed the efficient algorithm SRMM to monitor