Periodicity Detection Method For Small-Sample Time Series Datasets
Periodicity Detection Method For Small-Sample Time Series Datasets
Open Access
Full open access to this and
thousands of other papers at
Methodology
http://www.la-press.com.
Daisuke Tominaga
Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Aomi 2-4-7,
Koto, Tokyo, 135-0064, Japan. Corresponding author email: tominaga@cbrc.jp
Abstract: Time series of gene expression often exhibit periodic behavior under the influence of multiple signal pathways, and are
represented by a model that incorporates multiple harmonics and noise. Most of these data, which are observed using DNA microarrays,
consist of few sampling points in time, but most periodicity detection methods require a relatively large number of sampling points.
We have previously developed a detection algorithm based on the discrete Fourier transform and Akaike’s information criterion. Here
we demonstrate the performance of the algorithm for small-sample time series data through a comparison with conventional and newly
proposed periodicity detection methods based on a statistical analysis of the power of harmonics.
We show that this method has higher sensitivity for data consisting of multiple harmonics, and is more robust against noise than other meth-
ods. Although “combinatorial explosion” occurs for large datasets, the computational time is not a problem for small-sample datasets.
The MATLAB/GNU Octave script of the algorithm is available on the author’s web site: http://www.cbrc.jp/%7Etominaga/piccolo/.
Keywords: periodicity detection, gene expression time series, information criterion, discrete Fourier transform, circadian rhythm
doi: 10.4137/BBI.S5983
This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
a period of interest for DNA microarray datasets, thus method and Dixon’s Q test method are applied to
satisfying the expectations for b iological data. logarithms of powers.
Table 1. P-values and standard deviations (sd) of the normality test for the distribution of powers and logarithms of powers
of time series data in simulation datasets for the one-harmonic and two-harmonic conditions.
Table 2. P-values and their standard deviations (sd) for the normality test of the distribution of powers and logarithms of
powers of time series data in datasets taken from the Gene Expression Omnibus database.
900
1400 6000
600
700 3000
300
0 0 0
10 15 20 25 30 35 10 15 20 25 30 35 10 15 20 25 30 35
2000
3000
1400
1500
2000
1000
700
1000
500
0 0 0
10 15 20 25 30 10 15 20 25 30 35 −5 0 5 10 15 20 25
2000 2000
1500
1500 1500
1000
1000 1000
500
500 500
0 0 0
−5 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
1500 1500
3000
1000 1000
1500
500 500
0 0 0
−5 0 5 10 15 20 25 −5 0 5 10 15 20 25 0 5 10 15 20 25
Figure 1. Histograms of logarithms of powers for all twelve DNA microarray datasets for the performance comparison in the result section. The x axes are
bins of histograms. Each bin is a range of natural logarithms of powers of each time series in each dataset. The y axes are frequencies of logarithms of
powers in each range. The sum of all frequencies are same as the number of probes in each dataset.
to be normal. This test, which ignores redundant The model is a subset of the set of the Fourier
information from half of a two-sided power spectrum, coefficients obtained by DFT from given time series
is used to detect outliers from a set of logarithms of data. The number of Fourier coefficients is n when n
power spectra. The criterion of the test is a critical is the number of samples in the time series; however,
value of 0.95, one-sided.17 half of these coefficients are complex conjugates of
the other half. A Fourier coefficient must always be
Ahdesmäki’s method selected with its conjugate. This allows the inverse
The Ahdesmäki’s method12 uses the kernel density DFT of the model to be real numbers, which is
estimation18 of the distribution of the square root of the necessary to calculate the AIC. Thus, the number of
targeted harmonic’s power (proportional to the loga- model parameters is the number of the coefficient
rithm of power). The distribution is approximated by pairs. When the data length is even, a coefficient cor-
shuffling the order of samples in the time series data responds to the Nyquist frequency is a pure real num-
and calculating the power of the targeted harmonic ber and its complex conjugate do not appear in the set
by least-square fitting. We use Yi Cao’s ‘gkde’ kernel of Fourier coefficients in the model. This coefficient
density estimation method*1 to calculate the approxi- does not form a pair when it is chosen.
mate probability density function (PDF) of powers of The AIC value is calculated using the following
the harmonics, and we use the built-in function ‘ols’ equation:4
in GNU octave version 3.2.3*2 to calculate the power
of the harmonic. AIC = n log(2π) + n log(σ2) + n + 2p, (1)
The criterion of the test is a critical value of 0.95,
one-sided. where n is the number of samples (data length of the
time series), σ is the variance of errors between the given
The piccolo method time series data and the time series calculated from the
The ‘piccolo’ algorithm9 is an exhaustive search for model by inverse DFT, and p is the number of param-
the optimal combination of Fourier coefficients cal- eters (pairs of Fourier coefficients) in the model.
culated by DFT from a given time series data. The In the piccolo method, the Fourier coefficients in
algorithm searches for all possible subsets of conju- the subset that minimize the AIC value are taken to
gate pairs of Fourier coefficient, but the search range be significant constituents to represent the given data.
for a size of subsets is limited to keep the information Accordingly, periods corresponding to these Fourier
criterion value (AIC, BIC, etc.) reliable.4 coefficients are considered significant, and the given
Our previously presented version of the method time series data judged to be periodic with periods
incorporates BIC (Bayesian Information Criterion) corresponding to these Fourier coefficients. Thus,
as the information criterion. Here we introduce AIC multiple periods can be found simultaneously even if
(Akaike’s Information Criteriron) instead of BIC to their powers are close each other.
improve detection performance. The previous ver-
sion is called ‘piccolo/B’ in this paper. The ‘piccolo’ Result
implies new AIC version. Fourteen datasets are used to compare the five meth-
The optimal subset is defined such that the AIC ods for periodicity detection, comprising two simu-
value calculated from the subset and given data is mini- lated datasets and twelve DNA microarray datasets
mal. AIC is used as the information criterion under the taken from an online database.
assumptions that the error distribution of the datum at
each time point is normal and that its variance is the Robustness against noise
same as the variance among the time series data.4 Data
We tested the robustness against noise of the five
periodicity detection methods, namely the quantile
http://www.mathworks.com/matlabcentral/fileexchange/19160
*1
http://www.gnu.org/software/octave/doc/interpreter/Linear-Least-Squares.
*2 method, Dixon’s Q test, Ahdesmäki’s method, the
html piccolo/B and the piccolo method, using simulation
data consisting of one or two harmonic signals and In the two-harmonic condition, simulation data
log-normal noise. consist of two signals (16 and 24 hour harmonics)
Considering that the distribution of DNA microarray and noise, however, the three detection method except
data is log-normal,19 each datum is created as a sum piccolo and piccolo/B can hardly detect plural signals
of a log-normally distributed random number and the simultaneously in principle. Therefore we tested the
value of one harmonic (the one-harmonic condition), five methods on detection of a 24-hour siginal.
or two harmonics (the two-harmonic condition). Plots of the number of detected time series data on
For both conditions, 15 datasets are generated by each dataset are shown in Figure 2. For both the one-
changing the signal-to-noise ratio as follows: harmonic and two-harmonic conditions, the piccolo
method achieved a high detection rate, especially for
2π 2π noisy (low RSN) data.
logN(0,1) + Ai cos t + Ci + Bi cos t + Di
24 16 The detection performance was relatively lower at
RSN = 1.0 in the one-harmonic condition except for
where logN(0,1) is log-normaly distributed ran- the piccolo method. In this dataset, the variance of
dom noise whose mean is 0 and variance is 1, i the signal and noise is the same; thus, the signal and
(i = 1, …, 500) is the suffix for time series, Ai and
Bi are amplitude of harmonic signals whose period 1000
are 24-hour and 16-hour respectively (Bi = 0 for the
# of detected on 500 data
noise are difficult to distinguish, especially for small points). GDS2232 is a set of twenty four samples
sampled time series data. of normal mouse adrenal glands for 44 hours, every
The number of detected time series in the two- 4 hours (twelve time points). The dataset contains
harmonics condition is lower than that in the one- two samples for each time point. We only use one
harmonic condition for RSN . 0.01. The difference of them, which appears earlier in the published data
between the one-harmonic condition and two- file. GDS404 is a set of thirteen samples of normal
harmonic condition is smaller for the piccolo method mouse aortae for 44 hours, every 4 hours (twelve
than for the other methods. time points). The dataset contains two samples
for the first time point. We only use one of them,
Detection of circadian rhythm which appears earlier in the published data file.
Data GSE3424 is a set of eight samples of normal Ara-
The five detection methods are applied to experimen- bidopsis thaliana for 20 hours, every 4 hours (six
tally observed DNA microarray data taken from the time points). The dataset contains two samples for
Gene Expression Omnibus online database by NCBI, two time points (0-hour and 12-hour). We only use
NIH,10 to detect genes (probes) which have 24-hour one of them, which appears earlier in the published
periodicity, or ‘circadian rhythm’. data file. GSE6542 is a set of fourty eight samples
The P-values obtained by the Kolmogorov- of three mutants of Drosophila melanogaster in two
Smirnov test for the normality of the distribution experimental conditions (seven conditions in total).
of powers and logarithms of powers are shown in We divide it into seven sub-datasets here. Six sub-
Table 2. P-values are calculated for time series data datasets consist of six time points and one consists
in datasets, and means and standard deviations of the of twelve points. All these datasets are normalized
P-values are calculated and listed in the table. For both by publishers for further analysis.
powers and logarithms of powers, the null hypothesis Data of duplicate probes for same gene and data of
(the distribution is normal) cannot be rejected at the probes which contain a numerically invalid value are
95% confidence level. Although the samples sizes are ignored for this performance comparison.
small (6 to 12), it can be said that Dixon’s Q test can-
not be considered inappropriate. Detection performance
It is not defined whether or not the time series in The detection results are shown in Table 3. For both
the datasets are circadian; however, some of them are the total number of detected probes and the num-
labeled with the GO term20 ‘circadian rhythm’. Here, ber of detected probes labeled circadian, the piccolo
detection performance is evaluated in terms of the total method is superior to the other four methods, includ-
number of detected probes and the number of detected ing previous version of the piccolo (piccolo/B), for
probes labeled ‘circadian rhythm’ for each dataset. The all datasets. Ratios of S in Table 3, which is the
quantile method cannot be used on datasets in which number of probes detected by the piccolo method
the data length of each time series is 7 or less. but not by other four methods, to the number of
total probes in each dataset are 0.333 (GDS1629)
Biological description of datasets to 0.776 (GSE6542_3). This means that using the
All twelve DNA microarray datasets are time piccolo method we find that 77.6% of all probes in
series observations intending to analyze circadian GSE6542_3 are under the influence of circadian
rhythm. GDS1629 is a set of fourty five samples of oscillation mechanisms but other four methods can-
a immortalized suprachiasmatic nucleus cell line of not detect these probes.
normal rat for 42 hours, every 6 hours (eight time On the other hand, ratios of the numbers of probes
points). The dataset contains five or six samples for detected by one or more of the other four methods
each time point. We only use one of them, whose but not detected by the piccolo method to the number
sample ID is the largest. GDS2110 is a set of six of total probes in each dataset are in the range of 0.0
samples of normal Macaca mulatta adult females (GSE6542_2, GSE6542_4, GSE6542_6) to 0.0418
adrenal glands for 20 hours, every 4 hours (six time (GDS404), or less than 5% (data not shown).
Table 3. Results of detection of circadian oscillation on the twelve DNA microarray datasets. Numbers before and after a
slash are the number of detected probes and detected circadian annotated probes respectively. The annotated probes are
labeled with the GO term ‘circadian rhythm’ in the chip definition files of the microarrays.
C Quantile Q test Ahdesmäki Piccolo/B Piccolo S
GDS1629 22 146 / 1 60 / 0 121 / 0 163 / 1 2231 / 7 1981
GDS2110 26 – 667 / 0 457 / 2 0/0 10658 / 16 9745
GDS2232 37 4005 / 1 3053 / 3 5343 / 4 9118 / 11 23057 / 28 10892
GSE3424 33 – 2853 / 1 1436 / 2 0/0 19233 / 29 15837
GDS404 12 714 / 2 497 / 2 655 / 3 752 / 3 4044 / 6 2670
GSE6542-1 28 – 529 / 0 401 / 1 0/0 8554 / 23 7829
GSE6542-2 28 – 413 / 0 339 / 2 0/0 7706 / 23 7118
GSE6542-3 28 – 720 / 0 340 / 1 0/0 9939 / 23 9099
GSE6542-4 28 – 495 / 0 361 / 0 0/0 8924 / 23 8235
GSE6542-5 28 799 / 3 623 / 0 656 / 5 1046 / 4 6038 / 19 4335
GSE6542-6 28 – 534 / 0 378 / 1 0/0 8513 / 20 7800
GSE6542-7 28 – 501 / 0 403 / 0 0/0 8236 /19 7517
Notes: C is the number of circadian probes in the chip used for each dataset (duplicate probes for each gene and probes containing invalid numerical data
are omitted). S is the number of probes detected only by the piccolo method but not by other four methods.
20 40 520
18 35 500
16 480
30
14 460
12 25 440
10 20 420
8 15 400
6 380
10
4 360
2 5 340
0 0 320
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 35 40 45 50 55 60 65 70 75 80 85
GDS1629 GDS2110 GDS2232
1200 18 2.4
1100 16 2.35
1000 14
12 2.3
900
10 2.25
800
8 2.2
700 6
600 4 2.15
500 2 2.1
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 0 5 10 15 20
GDS404 GSE3424 GSE6542_1
Figure 3. Plots of time series data which are detected only by the piccolo method and not by the other four methods. For each dataset, the time series
data of the probes with the largest ratio between the maximum power and the second largest power is plotted. Ranges of sampling time points are
different by datasets. Datasets and its time ranges are: Top (left to right)—GDS1629 (44 h), GDS2110 (20 h), GDS2232 (44 h), Second (left to right)—
GDS404 (44 h), GSE3424 (20 h), GSE6542_1 (20 h), Third (left to right)—GSE6542_2 (20 h), GSE6542_3 (20 h), GSE6542_4 (20 h), Bottom (left to
right)—GSE6542_5 (44 h), GSE6542_6 (20 h), GSE6542_7 (20 h).
10000
piccolo
Discussion
Ahdesmaeki Five methods for periodicity detection, namely, two
1.5 IQR
simple methods (the quantile method and Dixon’s Q
Computatinal time [s]
1000 Q test
A comparison of the five methods using simulation 8. Yang R, Su Z. Analyzing circadian expression data by harmonic regression
based on autoregressive spectral estimation. Bioinformatics. 2010;26:
data shows that the piccolo method is most robust i168–74.
against noise. The detection performance of the 9. Tominaga D, Horimoto K. Judgment algorithm for periodicity of time series
data based on bayesian information criterion. Journal of Bioinformatics and
methods, except the piccolo method, was worse for Computational Biology. 2008;6(4):747–57.
the two-harmonic data than for the one-harmonic 10. Barrett T, Suzek TO, Troup DB, et al. NCBI GEO: mining millions of expres-
data. The piccolo method exhibited more consistent sion profiles-database and tools. Nucleic Acids Research. 2005;33:D562–6.
11. Parkinson H, Kapushesky M, Kolesnikov N, et al. ArrayExpress update-
performance between datasets than the other methods. from an archive of functional genomics experiments to the atlas of gene
This suggests that the piccolo method has high detec- expression. Nucleic Acids Research. 2009;37:D868–72.
12. Ahdesmäki M, Lähdesmäki H, Pearson R, et al. Robust detection of peri-
tion performance for data with multiple periodicity. odic time series measured from biological systems. BMC Bioinformatics.
The computational cost of the piccolo method 2005;6:117.
represents a potential problem for large datasets. In 13. Hogg RV, McKean JW, Craig AT. Introduction to Mathematical Statistics.
6th ed. Peason Prentice Hall; 2005.
future work, we will attempt to reduce the compu- 14. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection.
tational cost by introducing the branch and bound Wiley-Interscience; 2003.
15. Dixon WJ. Analysis of extreme values. Annals of Mathematical Statistics.
method to the exhaustive search for the combination 1950;21:488–506.
of Fourier coefficients. 16. Dixon WJ. Ratios involving extreme values. Annals of Mathematical
Statistics. 1951;22:68–78.
17. Rorabacher DB. Statistical treatment for rejection of deviant values: critical
Acknowledgement values of Dixon’s “Q” parameter and related subrange ratios at the 95%
We wish to thank Drs. Wataru Fujibuchi and confidential level. Analytical Chemistry. 1991;63(2):139–46.
18. Silverman BW. Density Estimation for Statistics and Data Analysis.
Sachiyo Aburatani of the CBRC, AIST, for fruitful Chapman and Hall/CRC; 1986.
discussions. 19. Konishi T. Three-parameter lognormal distribution ubiquitously found in
cDNA microarray data and its application to parametric data treatment.
BMC Bioinformatics. 2004;5:5.
Disclosure 20. The Gene Ontology Consortium. Gene ontology: tool for the unification of
This manuscript has been read and approved by the biology. Nature Genetics. 2000;25(1):25–9.
author. This paper is unique and is not under con-
sideration by any other publication and has not been
published elsewhere. The author and peer reviewers Publish with Libertas Academica and
of this paper report no conflicts of interest. The author every scientist working in your field can
confirms that they have permission to reproduce any read your article
copyrighted material.
“I would like to say that this is the most author-friendly
editing process I have experienced in over 150
References publications. Thank you most sincerely.”
1. Ernst J, Bar-Joseph Z. STEM: a tool for the analysis of short time series gene
expression data. BMC Bioinformatics. 2006;7:191.
2. McQuarrie ADR, Tsai CL. Regression and Time Series Model Selection. “The communication between your staff and me has
World Scientific; 1998. been terrific. Whenever progress is made with the
3. Artis M, Hoffmann M, Nachane D, Toro J. The detection of hidden peri- manuscript, I receive notice. Quite honestly, I’ve
odicities: A comparison of alternative methods. EUI Working Paper ECO. never had such complete communication with a
2004;10.
4. Sakamoto Y, Ishiguro K, Kitagawa G. Akaike Information Criterion Statistics.
journal.”
Springer verlag; 1986.
5. Benedetto JJ, Pfander GE. Periodic wavelet transforms and periodicity detec- “LA is different, and hopefully represents a kind of
tion. SIAM Journal of Applied Mathematics. 2002;62(4):1329–68. scientific publication machinery that removes the
6. Janer L, Bonet JB, Lleida-Solano E. Pitch detection and voiced/unvoiced hurdles from free flow of scientific thought.”
decision algorithm based on wavelet transform. Proceedings of The Fourth
International Conference on Spoken Language Processing. 1996;2(FrP2P1):
1209–12. Your paper will be:
7. Okamura H, Semba Y. A novel statistical method for validating the period- • Available to your entire community
icity of vertebral growth band formation in elasmobranch fishes. Canadian
free of charge
Journal of Fisheries and Aquatic Sciences. 2009;66(5):771–80.
• Fairly and quickly peer reviewed
• Yours! You retain copyright
http://www.la-press.com