Time series
Data sources such as usage logs, sensor measurements, financial instruments, the presence of a time-stamp results in an implicit temporal ordering on the observations.
In these applications, it becomes important to be able to treat the time-stamp as an index around which several important operations such as:
- grouping the data with respect to various intervals of time
- aggregating data across time intervals
- aggregate/impute raw data into regular discrete intervals
The TimeSeries
object is the fundamental data structure for multivariate time
series data. TimeSeries objects are backed by a single SFrame
, but include
extra metadata.
$$T$$ | $$V_1$$ | $$V_2$$ | $$...$$ | $$V_k$$ |
---|---|---|---|---|
$$t_{1}$$ | $$v_{11}$$ | $$v_{21}$$ | $$...$$ | $$v_{k1}$$ |
$$t_{2}$$ | $$v_{12}$$ | $$v_{22}$$ | $$...$$ | $$v_{k2}$$ |
$$t_{3}$$ | $$v_{13}$$ | $$v_{23}$$ | $$...$$ | $$v_{k3}$$ |
$$...$$ | $$...$$ | $$...$$ | $$...$$ | $$...$$ |
$$...$$ | $$...$$ | $$...$$ | $$...$$ | $$...$$ |
$$t_{n}$$ | $$v_{1n}$$ | $$v_{2n}$$ | $$...$$ | $$v_{kn}$$ |
Each column pair $$(V_i, T)$$ in the table corresponds to a univariate time series. $$V_i$$ is the value column for $$T$$ is the index column that is shared among all the single (univariate) time series.
In this chapter, we will use a dataset obtained from the UCI machine learning repository. The dataset contains measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. The entire dataset contains around 2,075,259 measurements gathered between December 2006 and November 2010 (47 months). The dataset is stored as an SFrame that can be loaded as follows:
import graphlab as gl
household_data = gl.SFrame("http://s3.amazonaws.com/dato-datasets/household_electric_sample.sf")
Data:
+---------------------+-----------------------+---------+---------------------+
| Global_active_power | Global_reactive_power | Voltage | DateTime |
+---------------------+-----------------------+---------+---------------------+
| 4.216 | 0.418 | 234.84 | 2006-12-16 17:24:00 |
| 5.374 | 0.498 | 233.29 | 2006-12-16 17:26:00 |
| 3.666 | 0.528 | 235.68 | 2006-12-16 17:28:00 |
| 3.52 | 0.522 | 235.02 | 2006-12-16 17:29:00 |
| 3.7 | 0.52 | 235.22 | 2006-12-16 17:31:00 |
| 3.668 | 0.51 | 233.99 | 2006-12-16 17:32:00 |
| 3.27 | 0.152 | 236.73 | 2006-12-16 17:40:00 |
| 3.728 | 0.0 | 235.84 | 2006-12-16 17:43:00 |
| 5.894 | 0.0 | 232.69 | 2006-12-16 17:44:00 |
| 7.026 | 0.0 | 232.21 | 2006-12-16 17:46:00 |
+---------------------+-----------------------+---------+---------------------+
[1025260 rows x 4 columns]
Time series construction
We construct a TimeSeries
object from the SFrame household_data
by
specifying the DateTime
column as the index column. The data is sorted by
the DateTime
column when indexed into a time series.
household_ts = gl.TimeSeries(household_data, index="DateTime")
The index column of the TimeSeries is: DateTime
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 17:24:00 | 4.216 | 0.418 | 234.84 |
| 2006-12-16 17:26:00 | 5.374 | 0.498 | 233.29 |
| 2006-12-16 17:28:00 | 3.666 | 0.528 | 235.68 |
| 2006-12-16 17:29:00 | 3.52 | 0.522 | 235.02 |
| 2006-12-16 17:31:00 | 3.7 | 0.52 | 235.22 |
| 2006-12-16 17:32:00 | 3.668 | 0.51 | 233.99 |
| 2006-12-16 17:40:00 | 3.27 | 0.152 | 236.73 |
| 2006-12-16 17:43:00 | 3.728 | 0.0 | 235.84 |
| 2006-12-16 17:44:00 | 5.894 | 0.0 | 232.69 |
| 2006-12-16 17:46:00 | 7.026 | 0.0 | 232.21 |
+---------------------+---------------------+-----------------------+---------+
[1025260 rows x 4 columns]
The following figure illustrates the multivariate time series. The index column
DateTime
is the x-axis and the columns Global_active_power
,
Global_reactive_power
, and Voltage
are illustrated in the y-axis.
Now, the dataset is indexed by the column Datetime
and all future operations
involving time are now optimized. At any point of time, the time series can be
converted to an SFrame using the to_sframe
function at zero cost.
sf = household_ts.to_sframe()
Note that each column in the TimeSeries
object is an SArray. A subset
of columns can be selected as follows:
ts_power = household_ts[['Global_reactive_power', 'Global_reactive_power']]
The following figure illustrates the time series ts_power
.
Resampling
In many practical time series analysis problems, we require observations to be over uniform time intervals. However, data is often in the form of non-uniform events with accompanying time stamps. As a result, one common prerequisite for time series applications is to convert an time series that is potentially irregularly sampled to one that is sampled at a regular frequency (or to a frequency different from the input data source).
There are three important primitive operations required for this purpose:
- Mapping – The operation which determines which time slice a specific observation belongs to.
- Interpolation/Upsampling – The operation used to fill in the missing values when there are no observations that map to a particular time slice.
- Aggregation/Downsampling –The operation used to aggregate multiple observations that below to the same time slice.
As an example, we resample the household_ts
into a time-series at an hourly
granularity.
import datetime as dt
day = dt.timedelta(days = 1)
daily_ts = household_ts.resample(day, downsample_method='max', upsample_method=None)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 00:00:00 | 7.026 | 0.528 | 243.73 |
| 2006-12-17 00:00:00 | 6.58 | 0.582 | 249.07 |
| 2006-12-18 00:00:00 | 5.436 | 0.646 | 248.48 |
| 2006-12-19 00:00:00 | 7.84 | 0.606 | 248.89 |
| 2006-12-20 00:00:00 | 5.988 | 0.482 | 249.48 |
| 2006-12-21 00:00:00 | 5.614 | 0.688 | 247.08 |
| 2006-12-22 00:00:00 | 7.884 | 0.622 | 248.82 |
| 2006-12-23 00:00:00 | 8.698 | 0.724 | 246.77 |
| 2006-12-24 00:00:00 | 6.498 | 0.494 | 249.27 |
| 2006-12-25 00:00:00 | 6.702 | 0.7 | 250.62 |
+---------------------+---------------------+-----------------------+---------+
[1442 rows x 4 columns]
The following figure illustrates the resampled time series daily_ts
.
In this example, the mapping is performed by choosing intervals of length
1 hour, the downsampling method is chosen by returning the maximum
value (for each column) of all the data points in the original time series, the
upsampling method sets a None
value (for a column) corresponding to an
interval in the returned time series if there are no any values (for that
column) within that time interval in the original time series.
Shifting time series data
Time series data can also be shifted along the time dimension using the
TimeSeries.shift
and TimeSeries.tshift
methods.
The tshift
operator shifts the index column of the time series along the time
dimension while keeping other columns intact. For example, we can shift the
household_ts
by 5 mintues, so all the tuples by an hour:
interval = dt.timedelta(hours = 1)
shifted_ts = household_ts.tshift(interval)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 18:24:00 | 4.216 | 0.418 | 234.84 |
| 2006-12-16 18:26:00 | 5.374 | 0.498 | 233.29 |
| 2006-12-16 18:28:00 | 3.666 | 0.528 | 235.68 |
| 2006-12-16 18:29:00 | 3.52 | 0.522 | 235.02 |
| 2006-12-16 18:31:00 | 3.7 | 0.52 | 235.22 |
| 2006-12-16 18:32:00 | 3.668 | 0.51 | 233.99 |
| 2006-12-16 18:40:00 | 3.27 | 0.152 | 236.73 |
| 2006-12-16 18:43:00 | 3.728 | 0.0 | 235.84 |
| 2006-12-16 18:44:00 | 5.894 | 0.0 | 232.69 |
| 2006-12-16 18:46:00 | 7.026 | 0.0 | 232.21 |
+---------------------+---------------------+-----------------------+---------+
[1025260 rows x 8 columns]
The shift
operator shifts forward/backward all the value columns while
keeping the index column intact. Notice that this operator does not change the
range of the TimeSeries object and it fills those edge tuples that lost their
value with None
.
shifted_ts = household_ts.shift(steps = 3)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 17:24:00 | None | None | None |
| 2006-12-16 17:26:00 | None | None | None |
| 2006-12-16 17:28:00 | None | None | None |
| 2006-12-16 17:29:00 | 4.216 | 0.418 | 234.84 |
| 2006-12-16 17:31:00 | 5.374 | 0.498 | 233.29 |
| 2006-12-16 17:32:00 | 3.666 | 0.528 | 235.68 |
| 2006-12-16 17:40:00 | 3.52 | 0.522 | 235.02 |
| 2006-12-16 17:43:00 | 3.7 | 0.52 | 235.22 |
| 2006-12-16 17:44:00 | 3.668 | 0.51 | 233.99 |
| 2006-12-16 17:46:00 | 3.27 | 0.152 | 236.73 |
+---------------------+---------------------+-----------------------+---------+
[1025260 rows x 8 columns]
Index Join
Another important feature of TimeSeries objects in GraphLab Create is the
ability to efficiently join them across the index column. So far we created a
resampled TimeSeries from one of the electeric meters. Now is the time to join
the first resampled TimeSeries object ts1_resample_3m
with the second
TimeSeries object electric_meter_ts2
.
sf_other = gl.SFrame('http://s3.amazonaws.com/dato-datasets/household_electric_sample_2.sf')
ts_other = gl.TimeSeries(sf_other, index = 'DateTime')
household_ts.index_join(ts_other, how='inner')
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 17:24:00 | 4.216 | 0.418 | 234.84 |
| 2006-12-16 17:26:00 | 5.374 | 0.498 | 233.29 |
| 2006-12-16 17:28:00 | 3.666 | 0.528 | 235.68 |
| 2006-12-16 17:29:00 | 3.52 | 0.522 | 235.02 |
| 2006-12-16 17:31:00 | 3.7 | 0.52 | 235.22 |
| 2006-12-16 17:32:00 | 3.668 | 0.51 | 233.99 |
| 2006-12-16 17:40:00 | 3.27 | 0.152 | 236.73 |
| 2006-12-16 17:43:00 | 3.728 | 0.0 | 235.84 |
| 2006-12-16 17:44:00 | 5.894 | 0.0 | 232.69 |
| 2006-12-16 17:46:00 | 7.026 | 0.0 | 232.21 |
+---------------------+---------------------+-----------------------+---------+
+------------------+
| Global_intensity |
+------------------+
| 18.4 |
| 23.0 |
| 15.8 |
| 15.0 |
| 15.8 |
| 15.8 |
| 13.8 |
| 16.4 |
| 25.4 |
| 30.6 |
+------------------+
[1025260 rows x 5 columns]
The how
parameter in index_join
operator determines the join method. The
acceptable values are 'inner','left','right', and 'outer'. The behavior is
exactly like the SFrame join methods.
Time series slicing
The range of a time series is defined as the interval (start, end)
of the
time stamps that span the time series. It can be obtained as follows:
start_time, end_time = household_ts.range
(datetime.datetime(2006, 12, 16, 17, 24), datetime.datetime(2007, 11, 26, 20, 57))
We can obtain a slice of a time series that lies within its range using the
TimeSeries.slice
operator.
import datetime as dt
start = dt.datetime(2006, 12, 16, 17, 24)
end = dt.datetime(2007, 11, 26, 21, 2)
sliced_ts = household_ts.slice(start, end)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 17:24:00 | 4.216 | 0.418 | 234.84 |
| 2006-12-16 17:26:00 | 5.374 | 0.498 | 233.29 |
| 2006-12-16 17:28:00 | 3.666 | 0.528 | 235.68 |
| 2006-12-16 17:29:00 | 3.52 | 0.522 | 235.02 |
| 2006-12-16 17:31:00 | 3.7 | 0.52 | 235.22 |
| 2006-12-16 17:32:00 | 3.668 | 0.51 | 233.99 |
| 2006-12-16 17:40:00 | 3.27 | 0.152 | 236.73 |
| 2006-12-16 17:43:00 | 3.728 | 0.0 | 235.84 |
| 2006-12-16 17:44:00 | 5.894 | 0.0 | 232.69 |
| 2006-12-16 17:46:00 | 7.026 | 0.0 | 232.21 |
+---------------------+---------------------+-----------------------+---------+
[246363 rows x 4 columns]
We can also slice
the data for a particular year as follows:
start = dt.datetime(2010,1,1)
end = dt.datetime(2011,1,1)
ts_2010 = household_ts.slice(start, end)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2010-01-01 00:00:00 | 1.79 | 0.236 | 240.65 |
| 2010-01-01 00:01:00 | 1.78 | 0.234 | 240.07 |
| 2010-01-01 00:03:00 | 1.746 | 0.186 | 240.26 |
| 2010-01-01 00:06:00 | 1.68 | 0.1 | 239.72 |
| 2010-01-01 00:07:00 | 1.688 | 0.102 | 240.34 |
| 2010-01-01 00:08:00 | 1.676 | 0.072 | 241.0 |
| 2010-01-01 00:11:00 | 1.618 | 0.0 | 240.11 |
| 2010-01-01 00:13:00 | 1.618 | 0.0 | 240.09 |
| 2010-01-01 00:14:00 | 1.622 | 0.0 | 240.38 |
| 2010-01-01 00:15:00 | 1.622 | 0.0 | 240.4 |
+---------------------+---------------------+-----------------------+---------+
[229027 rows x 4 columns]
Time series grouping
Quite often in time series analysis, we are required to split a single large time series in to groups of smaller time series grouped based on a property of the time stamp (e.g. per day of week).
The output of this operator is a graphlab.timeseries.GroupedTimeSeries
object, which can be used for retrieving one or more groups, or iterating
through all groups. Each group is a separate time series which possesses the
same columns as the original time series.
In this example, we group the time series household_ts
by the day of the week.
household_ts_groups = household_ts.group(gl.TimeSeries.date_part.WEEKDAY)
print household_ts_groups.groups()
Rows: 7
[0, 1, 2, 3, 4, 5, 6]
household_ts_groups
is a GroupedTimeSeries
containing 7 groups where each
group is a single TimeSeries. In this example groups are named between 0 and 6
where 0 is Monday. We can access the data corresponding to a Monday as follows:
household_ts_monday = household_ts_groups.get_group(0)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-18 00:00:00 | 0.278 | 0.126 | 246.17 |
| 2006-12-18 00:03:00 | 0.206 | 0.0 | 245.94 |
| 2006-12-18 00:04:00 | 0.206 | 0.0 | 245.98 |
| 2006-12-18 00:06:00 | 0.204 | 0.0 | 245.22 |
| 2006-12-18 00:07:00 | 0.204 | 0.0 | 244.14 |
| 2006-12-18 00:08:00 | 0.212 | 0.0 | 244.0 |
| 2006-12-18 00:09:00 | 0.316 | 0.134 | 244.62 |
| 2006-12-18 00:10:00 | 0.308 | 0.132 | 244.61 |
| 2006-12-18 00:11:00 | 0.306 | 0.134 | 244.97 |
| 2006-12-18 00:12:00 | 0.306 | 0.136 | 245.51 |
+---------------------+---------------------+-----------------------+---------+
[146934 rows x 4 columns]
We can also iterate over all the groups in this GroupedTimeSeries object:
for name, group in household_ts_groups:
print name, group
Time series union
We can also merge multiple time series into a single one using the union
operator. The merged time series is a valid time series with the time stamps
sorted correctly. In this example, we will use the union
operator to re-unite
the time series that we split by the day of the week (using the group
operator).
household_ts_combined = household_ts_groups.get_group(0)
for i in range(1, 7):
group = household_ts_groups.get_group(i)
household_ts_combined = household_ts_combined.union(group)
+---------------------+---------------------+-----------------------+---------+
| DateTime | Global_active_power | Global_reactive_power | Voltage |
+---------------------+---------------------+-----------------------+---------+
| 2006-12-16 17:24:00 | 4.216 | 0.418 | 234.84 |
| 2006-12-16 17:26:00 | 5.374 | 0.498 | 233.29 |
| 2006-12-16 17:28:00 | 3.666 | 0.528 | 235.68 |
| 2006-12-16 17:29:00 | 3.52 | 0.522 | 235.02 |
| 2006-12-16 17:31:00 | 3.7 | 0.52 | 235.22 |
| 2006-12-16 17:32:00 | 3.668 | 0.51 | 233.99 |
| 2006-12-16 17:40:00 | 3.27 | 0.152 | 236.73 |
| 2006-12-16 17:43:00 | 3.728 | 0.0 | 235.84 |
| 2006-12-16 17:44:00 | 5.894 | 0.0 | 232.69 |
| 2006-12-16 17:46:00 | 7.026 | 0.0 | 232.21 |
+---------------------+---------------------+-----------------------+---------+
[1025260 rows x 4 columns]
Common operations with SFrame/SArray
Because the time series data structure is backed by an SFrame, there are many operations that behave exactly like the SFrame. These include
- Logical filters (row selection)
- SArray apply functions (univariate user defined functions UDFs)
- Time series apply functions (multivariate UDFs)
- Selecting columns
- Adding, removing, and swapping columns
- Head, tail, row range selection
- Joins (on the non-index column)
See the chapter on SFrame for more usage details on the above functions.
Save and Load
Just like every other object, the time series can be saved and loaded as follows:
household_ts.save("/tmp/first_copy")
household_ts_copy = graphlab.TimeSeries("/tmp/first_copy")