0% found this document useful (0 votes)
8 views

Periodicity Detection Method For Small-Sample Time Series Datasets

This document describes a methodology for detecting periodicity in small sample time series datasets, such as those from DNA microarray data. It compares the author's previously developed piccolo method, which uses the discrete Fourier transform and Akaike's information criterion, to other conventional and newly proposed periodicity detection methods. The results show that the piccolo method has higher sensitivity for data containing multiple harmonics and is more robust against noise than the other methods. It is well-suited for analysis of small sample biological time series data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Periodicity Detection Method For Small-Sample Time Series Datasets

This document describes a methodology for detecting periodicity in small sample time series datasets, such as those from DNA microarray data. It compares the author's previously developed piccolo method, which uses the discrete Fourier transform and Akaike's information criterion, to other conventional and newly proposed periodicity detection methods. The results show that the piccolo method has higher sensitivity for data containing multiple harmonics and is more robust against noise than the other methods. It is well-suited for analysis of small sample biological time series data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Bioinformatics and Biology Insights

Open Access
Full open access to this and
thousands of other papers at
Methodology
http://www.la-press.com.

Periodicity Detection Method for Small-Sample


Time Series Datasets

Daisuke Tominaga
Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Aomi 2-4-7,
Koto, Tokyo, 135-0064, Japan. Corresponding author email: tominaga@cbrc.jp

Abstract: Time series of gene expression often exhibit periodic behavior under the influence of multiple signal pathways, and are
­represented by a model that incorporates multiple harmonics and noise. Most of these data, which are observed using DNA microarrays,
consist of few sampling points in time, but most periodicity detection methods require a relatively large number of sampling points.
We have previously developed a detection algorithm based on the discrete Fourier transform and Akaike’s information criterion. Here
we demonstrate the performance of the algorithm for small-sample time series data through a comparison with conventional and newly
proposed periodicity detection methods based on a statistical analysis of the power of harmonics.
We show that this method has higher sensitivity for data consisting of multiple harmonics, and is more robust against noise than other meth-
ods. Although “combinatorial explosion” occurs for large datasets, the computational time is not a problem for small-sample datasets.
The MATLAB/GNU Octave script of the algorithm is available on the author’s web site: http://www.cbrc.jp/%7Etominaga/piccolo/.

Keywords: periodicity detection, gene expression time series, information criterion, discrete Fourier transform, circadian rhythm

Bioinformatics and Biology Insights 2010:4 127–136

doi: 10.4137/BBI.S5983

This article is available from http://www.la-press.com.

© the author(s), publisher and licensee Libertas Academica Ltd.

This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.

Bioinformatics and Biology Insights 2010:4 127


Tominaga

Introduction tests (such as Dixon’s Q test or ­Fisher’s G test) and


Life phenomena are observed as changes in time, and non-parametric tests (eg, the quantile/box-plot) are
many of these phenomena, such as circadian rhythm used to detect outliers. These methods frequently
and the cell cycle, exhibit periodic behavior. These do not detect the periodicity of interest if the signifi-
phenomena are common to many species and are cance of the period is close to that of other periods,
thought to be expression of essential mechanisms of even if the significance is high. Thus, these methods
life. In addition, irregular periodicity is caused by are inadequate for data with multiple periodicities.
abnormal stimuli or disorder of these mechanisms. In addition, a certain number of spectrum elements
Thus, a periodicity detection technique for time series are needed to make the outlier tests meaningful and
observation data is important in many areas of biology robust to noise. Thus, these methods are not optimal
and medicine. for small sampled time series data.
Generally, the observation of life phenomena Other advanced algorithms, such as wavelet-
incurs certain costs and thus the number of sampling based methods5,6 and model fitting using directional
points in time is often small, as in the case of DNA ­statistics7 have been proposed, however, few applica-
microarray data.1 For this reason, a reliable method of tions of these methods have been reported to date;
periodicity detection is needed for small datasets. therefore, their utility for the analysis of small sam-
Time series data on life phenomena can be rep- pled biological datasets remains an open question.
resented by a mathematical model consisting of An clustering method1 and an AR (autoregression)
noise and various simple formulae, such as polyno- model based periodicity detection method8 are devel-
mial functions or harmonics (sinusoidal functions).2 oped to be special for small sampled data. The first
A model of time series data of periodic phenomena one classifies genes by expression time series but do
should contain harmonics. If these harmonics are not detect period or periodicity. The second one can
judged significantly large by a statistical test, the phe- find a period and its P-value for each time series data,
nomena can be considered periodic. but do not detect multiple harmonics, ie, do not find
Generally, life phenomena are the result of complex ‘the second significant period’.
interactions of biological networks (gene regulatory Our previously proposed method, called the
networks, metabolic pathways, signal transduction ‘piccolo’,9 consists of the DFT and Bayesian Infor-
networks, etc.); thus, time series data on constituents mation Criterion (BIC),2 and is not based on an out-
of these networks can contain multiple harmonics lier detection. The algorithm is a exhaustive search to
with different periods. find the best combination of Fourier coefficients in
Periodicity detection techniques which are widely terms of the information criterion.4 The combinatorial
used can be classified into two categories: 1) model search does not require a long computational time for
fitting in the time domain, and 2) statistical signifi- most DNA microarray time series datasets found on
cance tests on power spectra. the web, such as the datasets in the Gene Expression
The first category includes methods based on ­Omnibus10 and ArrayExpress.11
direct curve fitting to the observed data. When the We improve the peridicity detection performance of
data can be modeled by n harmonics, the number of the piccolo method by introducing Akaike’s Informa-
parameters that are optimized by the fitting method tion Criterion (AIC)4 instead of BIC, and demonstrate
is 3n  +  1,3 which is too many parameters for small its performance through a comparison with two con-
datasets.4 ventional methods, one newly developed method and
Methods in the second category are widely used in the old version of our method (BIC version of the pic-
many area of science. The basic method is to calculate colo) on two simulation datasets and twelve microar-
the power spectra by using the discrete Fourier trans- ray datasets. The piccolo algorithm (new AIC version)
form (DFT) or the autocovariance matrix, and then to is shown to be highly sensitive and robust against noise
test the significance of each spectrum of a harmonic on simulated short time series data which consist of
by outlier detection methods.3 A simple method uses multiple (two) harmonic signals and noise. In addition,
quantiles of spectra to detect outliers. Both ­parametric the present method can achieve high detection rates of

128 Bioinformatics and Biology Insights 2010:4


Periodicity detection for small time series

a period of interest for DNA microarray datasets, thus method and Dixon’s Q test method are applied to
satisfying the expectations for b­ iological data. logarithms of powers.

Methods Quantile method


We choose two widely used conventional methods, Outlier detection using an inter quantile range (IQR)
one recently proposed methods and the older ver- is a basic and widely used technique in many scien-
sion of the piccolo method for comparison with the tific fields because it has been found empirically to
improved new piccolo method. The methods selected be useful for outlier elimination.13,14
for the comparison except the old version of the In the quantile method, the DFT is applied to the
piccolo are based on statistical tests on the logarithms data, and the power of each harmonic is calculated as
of power spectra. the product of its Fourier coefficient and its complex
The two conventional methods are the quantile conjugate. Then, quantile points of the logarithm of
method and Dixon’s Q test. The other method is a non- the powers and IQR are calculated. All logarithms of
parametric test for significance of the logarithms of powers are compared with the outlier bound, which is
power spectra, recently proposed by Ahdesmäki et al.12 the sum of the third quartile point (75 percentile point)
Dixon’s Q test requires the assumption that the and the IQR multiplied by 1.5 (for normally distributed
distribution of samples is normal. The other four samples, this is same as that critical value is 0.9541
methods, namely, the quantile method, Ahdesmäki’s one-sided). If a logarithm of a power is larger than the
method, and the old and new piccolo method, do not outlier bound, the harmonic corresponding to the power
require this assumption. In the piccolo method (both is significant, and thus the given time series data is con-
old and new), the error distribution at each data point sidered periodic. The periodicity of the time series is
(sampling points at various times in the time series the same as the s­ ignificant harmonics.
data) is assumed to be normal and its variance is The quantile method requires a sample size (data
assumed to be the same as that of the data.4 length) of 8 or more for power spectra. Power spec-
­According to results of statistical tests for normal- tra (real numbers) are calculated from Fourier coef-
ity of powers and logarithms of powers of each time ficients (complex numbers), which have symmetry;
series in all datasets (Tables 1 and 2), no conclusion thus, the number of unique samples of the spectra
can be reached regarding the distribution of powers is half the data length. The unique samples size is
and logarithms of powers. Note that the power of a (n-1)/2 for an odd data length n. If the number of
harmonic is calculated as the product of its ­Fourier samples is less than 4, the quantile method cannot
coefficient, which is calculated by DFT, and the com- detect any outliers because the bound is larger than
plex conjugate of the coefficient; therefore, the power the largest sample. Thus, this method cannot be used
of a harmonic is a real value. Logarithms of powers for time series data with data length of 7 or less.
are perhaps more suitable than the values of powers
themselves for the quantile method and Dixon’s Q Dixon’s Q test
test, considering histograms of logarithms of powers Dixon’s Q test15,16 is a widely used outlier detection
for each dataset (Fig.  1). ­Accordingly, the quantile algorithm, in which the sample distribution is assumed

Table 1. P-values and standard deviations (sd) of the normality test for the distribution of powers and logarithms of powers
of time series data in simulation datasets for the one-harmonic and two-harmonic conditions.

N T Int. Power Log of power


One harmonics 500 12 4 0.755 (sd: 0.204) 0.862 (sd: 0.175)
Two harmonics 500 12 4 0.771 (sd: 0.218) 0.855 (sd: 0.165)
Notes: Signal to noise ratio (RSN) is 0.1. P-values are calculated using the Kolmogorov-Smirnov test.
Abbreviations: N, number of time series data in each dataset; T, length of each time series in the dataset; Int., interval between each two samplings (h)
in each time series.

Bioinformatics and Biology Insights 2010:4 129


Tominaga

Table 2. P-values and their standard deviations (sd) for the normality test of the distribution of powers and logarithms of
powers of time series data in datasets taken from the Gene Expression Omnibus database.

N T Int. Power Log of power


GDS1629 6346 8 6 0.898 (sd: 0.127) 0.911 (sd: 0.112)
GDS2110 14904 6 4 0.936 (sd: 0.0712) 0.936 (sd: 0.0712)
GDS2232 29109 12 4 0.845 (sd: 0.172) 0.828 (sd: 0.182)
GSE3424 22759 6 4 0.922 (sd: 0.0751) 0.936 (sd: 0.0719)
GDS404 6484 12 4 0.871 (sd: 0.153) 0.865 (sd: 0.159)
GSE6542-1 11699 6 4 0.936 (sd: 0.0712) 0.936 (sd: 0.0712)
GSE6542-2 11699 6 4 0.936 (sd: 0.0720) 0.936 (sd: 0.0718)
GSE6542-3 11699 6 4 0.945 (sd: 0.0682) 0.944 (sd: 0.0681)
GSE6542-4 11699 6 4 0.931 (sd: 0.0735) 0.932 (sd: 0.0735)
GSE6542-5 11699 12 4 0.877 (sd: 0.145) 0.875 (sd: 0.147)
GSE6542-6 11699 6 4 0.937 (sd: 0.0714) 0.936 (sd: 0.0715)
Note: P-values are calculated using the Kolmogorov-Smirnov test.
Abbreviations: N, number of time series data in each dataset; T, length of each time series in the dataset; Int., interval between each two samplings (h)
in each time series.

GDS1629 GDS2110 GDS2232


1200 2100 9000

900
1400 6000
600
700 3000
300

0 0 0
10 15 20 25 30 35 10 15 20 25 30 35 10 15 20 25 30 35

GDS404 GSE3424 GSE6542_1


2100 4000 2500

2000
3000
1400
1500
2000
1000
700
1000
500

0 0 0
10 15 20 25 30 10 15 20 25 30 35 −5 0 5 10 15 20 25

GSE6542_2 GSE6542_3 GSE6542_4


2500 2500 2000

2000 2000
1500
1500 1500
1000
1000 1000
500
500 500

0 0 0
−5 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

GSE6542_5 GSE6542_6 GSE6542_7


4500 2000 2000

1500 1500
3000
1000 1000
1500
500 500

0 0 0
−5 0 5 10 15 20 25 −5 0 5 10 15 20 25 0 5 10 15 20 25

Figure 1. Histograms of logarithms of powers for all twelve DNA microarray datasets for the performance comparison in the result section. The x axes are
bins of histograms. Each bin is a range of natural logarithms of powers of each time series in each dataset. The y axes are frequencies of logarithms of
powers in each range. The sum of all frequencies are same as the number of probes in each dataset.

130 Bioinformatics and Biology Insights 2010:4


Periodicity detection for small time series

to be normal. This test, which ignores redundant The model is a subset of the set of the Fourier
information from half of a two-sided power spectrum, coefficients obtained by DFT from given time series
is used to detect outliers from a set of logarithms of data. The number of Fourier coefficients is n when n
power spectra. The criterion of the test is a critical is the number of samples in the time series; however,
value of 0.95, one-sided.17 half of these coefficients are complex conjugates of
the other half. A Fourier coefficient must always be
Ahdesmäki’s method selected with its conjugate. This allows the inverse
The Ahdesmäki’s method12 uses the kernel density DFT of the model to be real numbers, which is
estimation18 of the distribution of the square root of the ­necessary to calculate the AIC. Thus, the number of
targeted harmonic’s power (proportional to the loga- model parameters is the number of the coefficient
rithm of power). The distribution is approximated by pairs. When the data length is even, a coefficient cor-
shuffling the order of samples in the time series data responds to the Nyquist frequency is a pure real num-
and calculating the power of the targeted harmonic ber and its complex conjugate do not appear in the set
by least-square fitting. We use Yi Cao’s ‘gkde’ kernel of Fourier coefficients in the model. This coefficient
density estimation method*1 to calculate the approxi- does not form a pair when it is chosen.
mate probability density function (PDF) of powers of The AIC value is calculated using the following
the harmonics, and we use the built-in function ‘ols’ equation:4
in GNU octave version 3.2.3*2 to calculate the power
of the harmonic. AIC = n log(2π) + n log(σ2) + n + 2p, (1)
The criterion of the test is a critical value of 0.95,
one-sided. where n is the number of samples (data length of the
time series), σ is the variance of errors between the given
The piccolo method time series data and the time series calculated from the
The ‘piccolo’ algorithm9 is an exhaustive search for model by inverse DFT, and p is the ­number of param-
the optimal combination of Fourier coefficients cal- eters (pairs of Fourier coefficients) in the model.
culated by DFT from a given time series data. The In the piccolo method, the Fourier coefficients in
algorithm searches for all possible subsets of conju- the subset that minimize the AIC value are taken to
gate pairs of Fourier coefficient, but the search range be significant constituents to represent the given data.
for a size of subsets is limited to keep the information Accordingly, periods corresponding to these Fourier
criterion value (AIC, BIC, etc.) reliable.4 coefficients are considered significant, and the given
Our previously presented version of the method time series data judged to be periodic with periods
incorporates BIC (Bayesian Information Criterion) corresponding to these Fourier coefficients. Thus,
as the information criterion. Here we introduce AIC multiple periods can be found simultaneously even if
(Akaike’s Information Criteriron) instead of BIC to their powers are close each other.
improve detection performance. The previous ver-
sion is called ‘piccolo/B’ in this paper. The ‘piccolo’ Result
implies new AIC version. Fourteen datasets are used to compare the five meth-
The optimal subset is defined such that the AIC ods for periodicity detection, comprising two simu-
value calculated from the subset and given data is mini- lated datasets and twelve DNA microarray datasets
mal. AIC is used as the information criterion under the taken from an online database.
assumptions that the error distribution of the datum at
each time point is normal and that its variance is the Robustness against noise
same as the variance among the time series data.4 Data
We tested the robustness against noise of the five
periodicity detection methods, namely the quantile
http://www.mathworks.com/matlabcentral/fileexchange/19160
*1

http://www.gnu.org/software/octave/doc/interpreter/Linear-Least-Squares.
*2 method, Dixon’s Q test, Ahdesmäki’s method, the
html piccolo/B and the piccolo method, using simulation

Bioinformatics and Biology Insights 2010:4 131


Tominaga

data consisting of one or two harmonic signals and In the two-harmonic condition, simulation data
log-normal noise. consist of two signals (16 and 24  hour harmonics)
Considering that the distribution of DNA  microarray and noise, however, the three detection method except
data is log-normal,19 each datum is created as a sum piccolo and piccolo/B can hardly detect plural signals
of a log-normally distributed random number and the simultaneously in principle. Therefore we tested the
value of one harmonic (the one-harmonic condition), five methods on detection of a 24-hour siginal.
or two harmonics (the two-harmonic ­condition). Plots of the number of detected time series data on
For both conditions, 15 datasets are generated by each dataset are shown in Figure 2. For both the one-
changing the signal-to-noise ratio as follows: harmonic and two-harmonic conditions, the piccolo
method achieved a high detection rate, especially for
 2π   2π  noisy (low RSN) data.
logN(0,1) + Ai cos  t + Ci  + Bi cos  t + Di 
 24   16  The detection performance was relatively lower at
RSN  =  1.0  in the one-harmonic condition except for
where logN(0,1) is log-normaly distributed ran- the piccolo method. In this dataset, the variance of
dom noise whose mean is 0 and variance is 1, i the signal and noise is the same; thus, the signal and
(i  =  1, …, 500) is the suffix for time series, Ai and
Bi are amplitude of harmonic signals whose period 1000
are 24-hour and 16-hour respectively (Bi = 0 for the
# of detected on 500 data

one-harmonic condition), and Ci and Di are phase


of each signal. Values of Ai and Bi are detemined by 100
log-normal random numbers according to the signal-
noise ratio (described later). Values of Ci and Di are
detemined by uniformly distributed random numbers
10
within the range of [0,48]. t is the time. Values of time piccolo
are discrete and its intervals are fixed to 4 hours. The piccolo/B
Ahdesmaeki
number of time points is 12. Each dataset consists of Quantile
Q test
500 time series data (each time series data consists of 1
0.001 0.01 0.1 1 10 100
twelve sampling points). RSN (S/N ratio)
The signal-to-noise ratio (RSN), which is defined as
a ratio of the variance of signal and noise, is set at 1000
various values of RSN = (0.001, 0.002, 0.005, …, 50.0)
# of detected on 500 data

under each condition. Thus, all generated time series


data in all datasets contain a circadian rhythm. 100
To consider whether or not Dixon’s Q test is appro-
priate, the normality of distribution of the spectra and
the logarithms of spectra is tested. The P-values cal- 10
culated by the Kolmogorov-Smirnov test for data- piccolo
piccolo/B
sets of RSN = 0.1 under both conditions are shown in Ahdesmaeki

Table 1. For both spectra and logarithms of spectra, Quantile


Q test
1
the null hypothesis (the distribution is normal) cannot 0.001 0.01 0.1 1 10 100
be rejected at the 90% confidence level. Thus, Dix- RSN (S/N ratio)
on’s Q test cannot be considered inappropriate.
Figure 2. Log/log plots of the signal-to-noise ratio versus the number
of detected time series data out of 500. The time series data consist
Detection performance of log-normal random noise and a harmonic (above), and log-normal
random noise and two harmonics (below). Since all simulated data
The numbers of detected time series from the datasets ­contains ­periodic signal to be detected, the possible maximum number
are compared to evaluate the robustness of the meth- of the detection is 500. The RSN is defined as a division of the vari-
ance of the signal by the variance of the noise. Therefore smaller RSN
ods against noise. value of the s­ imulated time series means that it is noisy data.

132 Bioinformatics and Biology Insights 2010:4


Periodicity detection for small time series

noise are difficult to distinguish, especially for small points). GDS2232 is a set of twenty four samples
sampled time series data. of normal mouse adrenal glands for 44 hours, every
The number of detected time series in the two- 4 hours (twelve time points). The dataset contains
­harmonics condition is lower than that in the one- two samples for each time point. We only use one
harmonic condition for RSN  .  0.01. The difference of them, which appears earlier in the published data
between the one-harmonic condition and two- file. GDS404 is a set of thirteen samples of normal
harmonic condition is smaller for the piccolo method mouse aortae for 44  hours, every 4  hours (twelve
than for the other methods. time points). The dataset contains two samples
for the first time point. We only use one of them,
Detection of circadian rhythm which appears earlier in the published data file.
Data GSE3424 is a set of eight samples of normal Ara-
The five detection methods are applied to experimen- bidopsis thaliana for 20  hours, every 4  hours (six
tally observed DNA microarray data taken from the time points). The dataset contains two samples for
Gene Expression Omnibus online database by NCBI, two time points (0-hour and 12-hour). We only use
NIH,10 to detect genes (probes) which have 24-hour one of them, which appears earlier in the published
periodicity, or ‘circadian rhythm’. data file. GSE6542 is a set of fourty eight samples
The P-values obtained by the Kolmogorov- of three mutants of Drosophila melanogaster in two
Smirnov test for the normality of the distribution experimental conditions (seven conditions in total).
of powers and logarithms of powers are shown in We divide it into seven sub-datasets here. Six sub-
Table 2. P-values are calculated for time series data datasets consist of six time points and one consists
in datasets, and means and standard deviations of the of twelve points. All these datasets are normalized
P-values are calculated and listed in the table. For both by publishers for further analysis.
powers and logarithms of powers, the null hypothesis Data of duplicate probes for same gene and data of
(the distribution is normal) cannot be rejected at the probes which contain a numerically invalid value are
95% confidence level. Although the samples sizes are ignored for this performance comparison.
small (6 to 12), it can be said that Dixon’s Q test can-
not be considered inappropriate. Detection performance
It is not defined whether or not the time series in The detection results are shown in Table 3. For both
the datasets are circadian; however, some of them are the total number of detected probes and the num-
labeled with the GO term20 ‘circadian rhythm’. Here, ber of detected probes labeled circadian, the piccolo
detection performance is evaluated in terms of the total method is superior to the other four methods, includ-
number of detected probes and the number of detected ing previous version of the piccolo (piccolo/B), for
probes labeled ‘circadian rhythm’ for each dataset. The all datasets. Ratios of S in Table  3, which is the
quantile method cannot be used on datasets in which number of probes detected by the piccolo method
the data length of each time series is 7 or less. but not by other four methods, to the number of
total probes in each dataset are 0.333 (GDS1629)
Biological description of datasets to 0.776 (GSE6542_3). This means that using the
All twelve DNA microarray datasets are time piccolo method we find that 77.6% of all probes in
series observations intending to analyze circadian GSE6542_3 are under the influence of circadian
rhythm. GDS1629 is a set of fourty five samples of oscillation mechanisms but other four methods can-
a immortalized suprachiasmatic nucleus cell line of not detect these probes.
normal rat for 42 hours, every 6 hours (eight time On the other hand, ratios of the numbers of probes
points). The dataset contains five or six samples for detected by one or more of the other four methods
each time point. We only use one of them, whose but not detected by the piccolo method to the number
sample ID is the largest. GDS2110 is a set of six of total probes in each dataset are in the range of 0.0
samples of normal Macaca mulatta adult females (GSE6542_2, GSE6542_4, GSE6542_6) to 0.0418
adrenal glands for 20 hours, every 4 hours (six time (GDS404), or less than 5% (data not shown).

Bioinformatics and Biology Insights 2010:4 133


Tominaga

Table 3. Results of detection of circadian oscillation on the twelve DNA microarray datasets. Numbers before and after a
slash are the number of detected probes and detected circadian annotated probes respectively. The annotated probes are
labeled with the GO term ‘circadian rhythm’ in the chip definition files of the microarrays.
C Quantile Q test Ahdesmäki Piccolo/B Piccolo S
GDS1629 22 146 / 1 60 / 0 121 / 0 163 / 1 2231 / 7 1981
GDS2110 26 – 667 / 0 457 / 2 0/0 10658 / 16 9745
GDS2232 37 4005 / 1 3053 / 3 5343 / 4 9118 / 11 23057 / 28 10892
GSE3424 33 – 2853 / 1 1436 / 2 0/0 19233 / 29 15837
GDS404 12 714 / 2 497 / 2 655 / 3 752 / 3 4044 / 6 2670
GSE6542-1 28 – 529 / 0 401 / 1 0/0 8554 / 23 7829
GSE6542-2 28 – 413 / 0 339 / 2 0/0 7706 / 23 7118
GSE6542-3 28 – 720 / 0 340 / 1 0/0 9939 / 23 9099
GSE6542-4 28 – 495 / 0 361 / 0 0/0 8924 / 23 8235
GSE6542-5 28 799 / 3 623 / 0 656 / 5 1046 / 4 6038 / 19 4335
GSE6542-6 28 – 534 / 0 378 / 1 0/0 8513 / 20 7800
GSE6542-7 28 – 501 / 0 403 / 0 0/0 8236 /19 7517
Notes: C is the number of circadian probes in the chip used for each dataset (duplicate probes for each gene and probes containing invalid numerical data
are omitted). S is the number of probes detected only by the piccolo method but not by other four methods.

20 40 520
18 35 500
16 480
30
14 460
12 25 440
10 20 420
8 15 400
6 380
10
4 360
2 5 340
0 0 320
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 35 40 45 50 55 60 65 70 75 80 85
GDS1629 GDS2110 GDS2232

1200 18 2.4
1100 16 2.35
1000 14
12 2.3
900
10 2.25
800
8 2.2
700 6
600 4 2.15
500 2 2.1
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 0 5 10 15 20
GDS404 GSE3424 GSE6542_1

6.12 7.05 8.8


6.1 7 8.75
6.08 6.95 8.7
6.06 6.9 8.65
6.04 6.85 8.6
6.02 6.8
6 6.75 8.55
5.98 6.7 8.5
5.96 6.65 8.45
5.94 6.6 8.4
5.92 6.55 8.35
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
GSE6542_2 GSE6542_3 GSE6542_4

4.55 5.35 4.06


4.5 5.3 4.04
4.45 4.02
5.25 4
4.4
4.35 5.2 3.98
4.3 5.15 3.96
3.94
4.25 5.1 3.92
4.2 3.9
5.05
4.15 3.88
4.1 5 3.86
4.05 4.95 3.84
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 0 5 10 15 20
GSE6542_5 GSE6542_6 GSE6542_7

Figure 3. Plots of time series data which are detected only by the piccolo method and not by the other four methods. For each dataset, the time series
data of the probes with the largest ratio between the maximum power and the second largest power is plotted. Ranges of sampling time points are
different by datasets. Datasets and its time ranges are: Top (left to right)—GDS1629 (44 h), GDS2110 (20 h), GDS2232 (44 h), Second (left to right)—
GDS404 (44 h), GSE3424 (20 h), GSE6542_1 (20 h), Third (left to right)—GSE6542_2 (20 h), GSE6542_3 (20 h), GSE6542_4 (20 h), Bottom (left to
right)—GSE6542_5 (44 h), GSE6542_6 (20 h), GSE6542_7 (20 h).

134 Bioinformatics and Biology Insights 2010:4


Periodicity detection for small time series

10000
piccolo
Discussion
Ahdesmaeki Five methods for periodicity detection, namely, two
1.5 IQR
simple methods (the quantile method and Dixon’s Q
Computatinal time [s]

1000 Q test

test), one recently proposed method (Ahdesmäki’s


100 method) and two methods by the authors (piccolo
and piccolo/B) are compared for small sampled (short
10 length) time series of two simulated datasets which
consist of twelve time points and twelve sets of experi-
1 mentally observed DNA microarray data, which consist
of 6, 8, 12 time points for observation of the circadian
0.1 rhythm.
5 10 15 20 25 30 35 40
Dixon’s Q test requires the assumption that the
Length of data
distribution of samples is normal. P-values of the
Figure 4. Plot of computational time which is needed to perform detec- normality of the distribution of the spectra and loga-
tion on 500 time series data. The x axis is the length of time series data
(the number of time points). The y axis is elapse CPU time in second to rithm of spectra of each time series in the given data-
perform detection in a logarithmic scale. 500 time series data are gener- sets were calculated by the Kolmogorov-Smirnov
ated by normally distributed random numbers. The CPU time of piccolo/B
method (previous version of the piccolo method, not shown here) is very test. The null hypothesis (the distribution is normal)
similar to the piccolo method which incorporates AIC. was not rejected for the logarithms of spectra of all
datasets.
The time series of a probe detected by only the pic- The piccolo method selects significant harmonics
colo method is plotted in each panel in Figure 3 (one to model the data. Harmonics included in the best
probe is chosen for each dataset). model that minimizes the AIC are significant.
A harmonic whose power is not a maximum can be
Computational cost detected as significant more frequently by using the
We measured the increase in computational time piccolo method compared with other outlier based
required to perform detection on 500 time series when methods. These smaller power harmonics are selected
the data length of each time series is increased from according to the AIC and therefore are considered to
6 to 40. The dataset consist of normally ­distributed be significant statistically. The high detection sensi-
­random numbers with a mean of 0 and variance of 1. tivity of the piccolo method is shown by results of
The results are shown in Figure 4. In the performance analyses using both simulations and experimentally
evaluation, all detection programs are run on GNU observed data. These results satisfies the expectations
octave version 3.2.3*3 on Mac OS X 10.6.3, and the that most genes in a living cell are involved in one
computer is equipped with two 3 GHz Dual-Core Intel or more gene regulatory networks and that these net-
Xeon and 8 GB of 667-MHz DDR2 core memory. works are interconnected. The oscillation of the core
The computational time of the quantile method, circadian clock genes are expected to spread over
Dixon’s Q test and Ahdesmäki’s method increase whole gene networks.
linearly with increasing data length. This increase is S in Table  3  shows that many genes exhibiting
exponential in the case of the piccolo method. The periodicity in the form of circadian rhythm can be
CPU time of the piccolo/B is almost same to the detected only by the piccolo method and not the other
­piccolo and not shown here. four methods. This finding can be attributed to the
The curves fit to data, ax  +  b for Ahdesmäki’s magnitude of circadian periodicity, which is thought to
method and exp(ax  +  b) for piccolo method, inter- depend on the ‘distance’ in the whole interconnected
sect at x = 18.8 (x is data length). The piccolo method gene regulatory networks from central circadian
is faster than Ahdesmäki’s method for small datasets clock systems. Many genes further from the central
with a data length of less than 19. clock systems could have lower magnitude circadian
periodicity and can not be detected by other methods
*3
http://octave.sourceforge.net/ than the piccolo.

Bioinformatics and Biology Insights 2010:4 135


Tominaga

A comparison of the five methods using simulation 8. Yang R, Su Z. Analyzing circadian expression data by harmonic regression
based on autoregressive spectral estimation. Bioinformatics. 2010;26:
data shows that the piccolo method is most robust i168–74.
against noise. The detection performance of the 9. Tominaga D, Horimoto K. Judgment algorithm for periodicity of time series
data based on bayesian information criterion. Journal of Bioinformatics and
methods, except the piccolo method, was worse for Computational Biology. 2008;6(4):747–57.
the two-harmonic data than for the one-harmonic 10. Barrett T, Suzek TO, Troup DB, et al. NCBI GEO: mining millions of expres-
data. The piccolo method exhibited more consistent sion profiles-database and tools. Nucleic Acids Research. 2005;33:D562–6.
11. Parkinson H, Kapushesky M, Kolesnikov N, et  al. ArrayExpress update-
performance between datasets than the other methods. from an archive of functional genomics experiments to the atlas of gene
This suggests that the piccolo method has high detec- expression. Nucleic Acids Research. 2009;37:D868–72.
12. Ahdesmäki M, Lähdesmäki H, Pearson R, et al. Robust detection of peri-
tion performance for data with multiple periodicity. odic time series measured from biological systems. BMC Bioinformatics.
The computational cost of the piccolo method 2005;6:117.
represents a potential problem for large datasets. In 13. Hogg RV, McKean JW, Craig AT. Introduction to Mathematical Statistics.
6th ed. Peason Prentice Hall; 2005.
future work, we will attempt to reduce the compu- 14. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection.
tational cost by introducing the branch and bound ­Wiley-Interscience; 2003.
15. Dixon WJ. Analysis of extreme values. Annals of Mathematical Statistics.
method to the exhaustive search for the combination 1950;21:488–506.
of Fourier coefficients. 16. Dixon WJ. Ratios involving extreme values. Annals of Mathematical
­Statistics. 1951;22:68–78.
17. Rorabacher DB. Statistical treatment for rejection of deviant values: critical
Acknowledgement values of Dixon’s “Q” parameter and related subrange ratios at the 95%
We wish to thank Drs. Wataru Fujibuchi and confidential level. Analytical Chemistry. 1991;63(2):139–46.
18. Silverman BW. Density Estimation for Statistics and Data Analysis.
Sachiyo Aburatani of the CBRC, AIST, for fruitful ­Chapman and Hall/CRC; 1986.
discussions. 19. Konishi T. Three-parameter lognormal distribution ubiquitously found in
cDNA microarray data and its application to parametric data treatment.
BMC Bioinformatics. 2004;5:5.
Disclosure 20. The Gene Ontology Consortium. Gene ontology: tool for the unification of
This manuscript has been read and approved by the biology. Nature Genetics. 2000;25(1):25–9.
author. This paper is unique and is not under con-
sideration by any other publication and has not been
published elsewhere. The author and peer reviewers Publish with Libertas Academica and
of this paper report no conflicts of interest. The author every scientist working in your field can
confirms that they have permission to reproduce any read your article
copyrighted material.
“I would like to say that this is the most author-friendly
editing process I have experienced in over 150
References publications. Thank you most sincerely.”
1. Ernst J, Bar-Joseph Z. STEM: a tool for the analysis of short time series gene
expression data. BMC Bioinformatics. 2006;7:191.
2. McQuarrie ADR, Tsai CL. Regression and Time Series Model Selection. “The communication between your staff and me has
World Scientific; 1998. been terrific. Whenever progress is made with the
3. Artis M, Hoffmann M, Nachane D, Toro J. The detection of hidden peri- manuscript, I receive notice. Quite honestly, I’ve
odicities: A comparison of alternative methods. EUI Working Paper ECO. never had such complete communication with a
2004;10.
4. Sakamoto Y, Ishiguro K, Kitagawa G. Akaike Information Criterion ­Statistics.
journal.”
Springer verlag; 1986.
5. Benedetto JJ, Pfander GE. Periodic wavelet transforms and periodicity detec- “LA is different, and hopefully represents a kind of
tion. SIAM Journal of Applied Mathematics. 2002;62(4):1329–68. scientific publication machinery that removes the
6. Janer L, Bonet JB, Lleida-Solano E. Pitch detection and voiced/unvoiced hurdles from free flow of scientific thought.”
decision algorithm based on wavelet transform. Proceedings of The Fourth
International Conference on Spoken Language Processing. 1996;2(FrP2P1):
1209–12. Your paper will be:
7. Okamura H, Semba Y. A novel statistical method for validating the period- • Available to your entire community
icity of vertebral growth band formation in elasmobranch fishes. Canadian
free of charge
Journal of Fisheries and Aquatic Sciences. 2009;66(5):771–80.
• Fairly and quickly peer reviewed
• Yours! You retain copyright

http://www.la-press.com

136 Bioinformatics and Biology Insights 2010:4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy