Escolar Documentos
Profissional Documentos
Cultura Documentos
FP-Tree is a tree data structure that stores transaction data Transaction ID Item IDs Transaction ID Item IDs
in compact size. The FP-tree is specialized in frequent pattern 1 b, c 6 a, c, e
search. It reduces data volume by excluding items that do not 2 b, d 7 b, c
3 a, b, e 8 a, c, e
fulfill minimum support condition and by aggregating 4 a, b 9 a, b, c
transactions that contain the same sets of items. FP-Tree 5 a, b, c, e 10 f
construction procedure is explained in the transactions shown
in Table I and the minimum support is set at 20%.
Item ID Count
First, a list is compiled in which all items are sorted in b 7
descending order of their frequencies (Fig. 1), and search a 6 b:7 a:2
items whose frequencies are larger than the minimum support. c 6
e 4 c:2
Items d and f do not fulfill the condition and do not form any c:2 a:4
d 1
frequent patterns; therefore, they were excluded from further f 1 e:2
procedures. Second, items in each transaction were sorted in e:1 c:2
the order of itemlist, and an FP-Tree was constructed as shown
in Fig. 1. Each node has three attributes: item ID, item count, e:1
and pointer. The item ID registers the item represented by the
node, Item count registers the number of transactions Figure 1. Itemlist and FP-Tree.
represented by the path reaching this node, and pointer links to
the next node in the FP-Tree with the same item ID, or null if TABLE II. TRANSACTIONS INCLUDE CPB {E}.
there is none. Sets of Items Count
b, a 1
The FP-Tree contains complete information for mining b, a, c 1
frequent patterns from transactions. For example, transactions a, c 2
that include item e are recorded on the paths from the root to
nodes of item e. Four transactions have item e: one transaction
with items b, a, and e, one transaction with items b, a, c, and e,
and two transactions with items a, c, and e. Item ID Count
a 4 a:4
The FP-tree compresses its data size by aggregating c 3
information of transactions with the same items. An efficient b 2 b:1 c:3
FP-tree with fewer data size is created when more frequent
items are placed closer to the root node. b:1
B. FP-Growth Algorithm Figure 2. Itemlist and conditional FP-Tree regarding CPB {e}.
The FP-Growth algorithm is a recursive algorithm
composed of following procedures. III. EXTRACTION OF CONGESTION PATTERNS BY EXTENDING
1. Select an item in order from the bottom of itemlist, and FREQUENT PATTERN MINING ALGORITHM
add it to the set of items of the conditional pattern-base (CPB). Suppose that dataset on congested links at each time for
multiple days are given. This study aims to extract congestion
2. Search transactions that include the added item to the
patterns that occur repeatedly in multiple days and represent
CPB from the FP-Tree. It is only required to consider items in
the propagation processes of traffic congestion from
the path from the root to a node of the item added to CPB,
bottlenecks to neighbor roads. It searches sets of congested
because the items adding to CPB are selected in order from the
links that are spatially connected and temporally continuous.
bottom of itemlist. Form the list of items whose frequency
exceeds the minimum support, and extract frequent patterns. An item is defined as a state in which congestion occurs on
a link at a period of time. A transaction is defined as a set of
3. Construct the conditional FP-Tree from the selected
items in a day. Then, frequent pattern mining can extract
transactions at the second process.
congestion patterns. However, the result includes unnecessary
An example of the procedure is shown here. First, item e patterns in which congested links are not connected with each
was selected, which was the least frequent item that satisfies other or the times of congestion occurrence are not continuous.
the minimum support condition, and added to the CPB; since The extraction of unnecessary patterns reduces the feasibility
the CPB is the null set at first, then it is updated to {e}. Next, of analysis. Therefore, it is essential to develop methods that
the transactions that include item e from the FP-Tree in Fig. 1. extract frequent patterns effectively.
were searched as shown in Table II, and the frequencies of
This section explains two extensions to extract congestion
other items were counted, forming an itemlist in Fig. 2. As
patterns efficiently.
items a, b, and c satisfy the minimum support condition,
itemsets {a, e}, {b, e}, {c, e} were extracted as frequent A. Extraction of Congestion Patterns that are Spatially
patterns. Then, the conditional FP-Tree is constructed when Connected and Temporally Continuous
the CPB is{e} from Table II. The recursive operation of
This study proposes an extension of FP-Growth algorithm
procedures to the conditional FP-tree in Fig. 2 can enumerate
by limiting a range of pattern search in itemsets configured by
all frequent patterns that include item e.
connected items. As discussed previously, the original
732
FP-Growth algorithm searches an item whose count exceeds TABLE III. SAMPLE TRANSACTION DATA
the given minimum support value from conditional FP-tree. If Transaction ID Item IDs
an item is found, the union of the item and the CPB are taken 1 a, b, c
as one of the frequent items. If the connectivity of items is 2 a, b, c
tested at this stage, the extraction of unnecessary patterns can 3 a, b, c, d
be avoided. If all items recorded in an FP-tree have no 4 a, b, c, d
5 a, b, d
connection with the items in the CPB, there is no need to
search the patterns from the tree. However, if there are items subsets of other patterns are not extracted, and all closed
that have connectivity in an FP-tree, transactions related to the patterns are extracted. When the threshold is set to 100%, it
items should be extracted from the tree. As FP-Growth does not affect the extraction of closed patterns. However,
algorithm extracts transaction data tracing nodes from leaf to when the aggregation threshold is set at 80%, items a, b, and c
root in the FP-tree, it is convenient to place items that have are aggregated into one and only {abc} is enumerated. When
connectivity to the items in CPB close to leaf nodes. The threshold is set other than 100%, the closed patterns that
original algorithm arranges items in descending order of partially overlapped with aggregated itemsets cannot be
frequency and constructs an FP-tree by placing items from the enumerated, although it reduces input data size, accelerates
root node. This study proposes to first arrange the items calculation, and improves feasibility of the frequent pattern
without connectivity to items in CPB, then the items with algorithm to the large dataset.
connectivity, and construct the FP-tree in the same procedure
of the original algorithm. As discussed previously, a trade-off exists between the
correctness of pattern mining results and the feasibility of the
This simple expansion limits the search range and makes mining algorithm. It is advisable to set the aggregation
the analysis more efficient compared with FP-Growth threshold higher as long as pattern mining algorithm works.
algorithm.
IV. APPLICATION
B. Closed Pattern Search through Aggregation of Frequent
Itemset A. Summary of Traffic Sensor Data and Congested Link
Information
In urban areas, traffic congestion is a daily occurrence at
the same bottleneck and time. Thus, some itemsets are This study analyzes the observation data of traffic sensors
observed to have extremely high ratio in the data. For instance, located in three Secondary Area Partitions of Standard Grid
some combinations of connected links are always congested at Square in the southern part of main island of Okinawa: Naha:
the same time and some links are congested at regular time #392725, Itoman: #392715, and Yonabaru: #392726.
during the day. When analyzing the congestion that continues The data consists of time-mean speeds and traffic volumes of
for a long time and expands to a wide area, it is necessary to every five minutes on 510 links in 853 days from June 1, 2011
analyze patterns that consist of many items that stand for to September 30, 2013.
congested links at a certain time. Frequent pattern mining This study first composes traffic congestion information of
enumerates all subsets in frequent patterns. The subsets of a each link at each time, based on an estimated critical speed of
pattern increase exponentially if the number of items in a each link. The critical speed is estimated separately by
pattern increases; therefore, it spoils the feasibility of analysis. weekdays or holidays, and hours without rainfall or with
This study aggregates itemsets that appear frequently and heavy rainfall, to analyze the differences of traffic congestion
simultaneously in preprocessing to solve this issue. This is patterns caused by the differences of types of vehicles and
expected to reduce the enumeration of subset patterns and drivers by days of the week, and by weather conditions. The
calculation time. This study adopts confidence for the observations at Naha meteorological station are used for
evaluation of itemsets to be aggregated. The minimum weather condition setting. Rainfall that exceeds 5 mm/h is
confidence value is calculated by setting the condition part by defined as heavy rain condition. The traffic sensor data is
each single item of the itemsets individually, and aggregate if observed for 20 472 h, where 18 831 h do not have rainfall,
the minimum value exceeds the user-specified threshold. and 378 h have heavy rainfall.
However, the results of frequent pattern extraction are Fig. 3 shows a sample of fundamental diagrams at the
influenced by the aggregation. Consider the case of Asahibashi intersection on the southbound of Route 58
transactions in Table III, and search frequent patterns when the without rainfall on weekdays. Traffic volume and time-mean
minimum support is set at 0.6. Without the aggregation, all speed are recorded as integer values in the traffic sensor data.
frequent patterns, which include subsets, are enumerated: {a}, The recorded pairs of volume and speed whose frequencies are
{b}, {c}, {d}, {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {a, b, c}, less than 0.01% of the total number of records of at least one
and {a, b, d}. There are two closed patterns, {a, b, c} and {a, b, vehicle observation are excluded as outliers. Fig. 3 shows the
d}, and others are the subset of the two. speeddensity and speedflow relations. Traffic density data is
rounded by every 0.5 vehicles per kilometers. Red points
Suppose that the aggregation threshold is set to 100%; it indicate data that satisfies the frequency threshold, and points
means that only itemsets where all items are always with colder color indicates data with smaller frequencies than
co-occurring are aggregated. In this example, items a and b the threshold. Space mean speed is required to estimate traffic
always appear in transactions and are therefore aggregated, density and requires time-mean speed instead of space-mean
with ab denoting the aggregated set of items a and b. After this speed to be adopted in the analysis due to the data limitation.
aggregation, the number of frequent patterns enumerated is Thus, average speed is overestimated and traffic density is
reduced to 3: {ab}, {ab, c}, and {ab, d}. Most patterns that are underestimated.
733
Next, critical speed on each link is estimated from traffic
density and average speed data. Fig. 3(a) shows that speed
values that correspond to the same density value have large
variation. To estimate the maximum flow performance of road
links, this study first removes data points in extreme condition
that is far from the traffic condition near critical speed; that is,
traffic density smaller than 5 vehicles per kilometer or average
speed is lower than 20 km/h. Then, a data point at the highest
speed value is selected for each density values. Using the
selected data points, the linear relationship between speed and (a) Speeddensity (b) Speedflow
density is estimated. The black line in Fig. 3(a) illustrates the Figure 3. Fundamental diagrams at Asahibashi intersection on the
estimated speeddensity relationship. Under the assumption southbound of Route 58 without rain on weekdays. (Link ID is 6 in the
that the relationship between density and speed is linear, Standard Grid Square Code 392725.)
critical speed corresponds to half of the intercept in the 100,000,000 100,000
speeddensity relationship. A red bold line in Fig. 3(b)
illustrates the estimated critical speed; it seems to correspond
execution time(sec)
1,000
5-min basis. However, the results of congestion judgment of
each link are unstable in time. It is possible that the influence 10,000
734
TABLE IV. EXECUTION TIME AND NUMBER OF CLOSED PATTERNS The proposed method generates many congestion patterns.
Execution time Number of The application of this study extracted patterns that continue
(sec) closed patterns for the longest time and expands to wider area. Analysts can
Without Fridays 3390.4 9 450 also extract patterns that expand to the widest area although
rainfall Mondays 61.1 3 140 their duration are relatively short and raises the possibility
With rainfall Weekdays 3391.1 15 067 that a different conclusion will be reached. It also needs to
consider which congestion patterns should be selected to
TABLE V. SELECTED CONGESTION PATTERNS IN THE CENTRAL NAHA analyze traffic congestion propagation processes.
Rainfall Day of the week All links Max links Further consideration on the aggregation of co-occurring
Fridays 72 9 itemsets in the preprocessing is also necessary. Reducing the
False
Pattern 1 Mondays 45 9 aggregation threshold improves the feasibility of the
True Weekdays 67 10
Fridays 68 9
proposed method. However, the closed patterns that partially
Pattern 2
False
Mondays 30 2 overlapped with the aggregated itemsets cannot be
True Weekdays 75 9 enumerated. Thus, an effective mining method that generates
Fridays 52 6 patterns at low minimum support values without reducing the
False
Pattern 3 Mondays 34 4 threshold for aggregation is required to be developed.
True Weekdays 42 6
The proposed method needs further investigations to
rainfall are compared, the characteristics of congestion confirm its applicability to larger dataset and its effectiveness
patterns are illuminated. The patterns on Fridays expand in comparative analyses of characteristics of traffic
wider and continue for a longer time as compared to the congestion patterns in different cities.
patterns on Mondays. This confirms that severe traffic
condition occurs on Fridays. The patterns on Fridays without ACKNOWLEDGMENT
rainfall and weekdays with rainfall have some resemblance; The authors would like to thank the Okinawa Prefectural
the congestion affected the traffic in broader area. Police Department for providing the traffic sensor data. This
When it is raining, congestion seems to occur at an earlier work was supported by JSPS KAKENHI Grant Numbers
time. The number of congested links reaches maximum at JP26630233 and JP15H04053.
around 6 p.m., although the heaviest congestion usually
occurs at around 7 p.m. when there is no rainfall. The REFERENCES
difference is thought to be caused by changes in traffic [1] G. F. Newell, A moving bottleneck, Transport. Res B: Meth., vol. 32,
demand patterns and deterioration of road performance. no. 8, pp. 531537, 1998.
[2] D. Ni and J. D. Leonard, A simplified kinematic wave model at a
The application of the proposed method confirmed the merge bottleneck, App. Math. Model., vol. 29, no. 11, pp. 10541072,
following findings: first, it succeeded in discovering 2005.
congestion patterns from the traffic sensor data for more than [3] J. Long, Z. Gao, H. Ren, and A. Lian, Urban traffic congestion
two years, which is much longer period than the application propagation and bottleneck identification, Sci. China Ser. F, vol. 51, no.
7, pp. 948964. 2008.
of previous studies. It is useful in finding daily traffic [4] W.-H. Lee, S.-S. Tseng, J.-L. Shieh, and H.-H. Chen, Discovering
congestion patterns. Second, the application result confirmed traffic bottlenecks in an urban network by spatiotemporal data mining
that the proposed method was able to enumerate congestion on location-based services, IEEE Trans. Intell. Transp. Syst., vol. 12,
patterns that illuminated a variety of traffic congestion no. 4, pp. 10471056, 2011.
patterns induced by the different traffic conditions. The [5] M. Miller and C. Gupta, Mining traffic incidents to forecast impact,
in Proc. ACM SIGKDD Int. Workshop on Urban Computing, Beijing,
heavier congestion occurring on Fridays might be perceivable China, August 1216, 2012, pp. 3340.
for daily drivers without the analysis of the proposed method. [6] W. Liu, Y. Zheng, S. Chawla, J. Yuan, and X. Xing, Discovering
However, it is difficult to grasp the spatial extent where spatio-temporal causal interactions in traffic data streams, Proc. 17th
congestion expands and the period of time when congestion ACM SIGKDD Int. Conf. on Knowl. Discov. and Data Min., San Diego,
continues on Fridays, and to recognize the differences of CA, August 2124, 2011, pp. 10101018.
[7] L. X. Pang, S. Chawla, W. Liu, and Y. Zheng, On detection of
congestion patterns by traffic conditions. These findings emerging anomalous traffic patterns using gps data, Data Knowl. Eng.,
denote the advantages of the proposed analysis. vol. 87, pp. 357373. 2013.
[8] V. W. Chu, R. K. Wong, W. Liu, and F. Chen, Causal structure
V. CONCLUSION discovery for spatio-temporal data, in Proc. 19th Int. Conf. Database
Syst. for Advanced Appl., Bali, Indonesia, 2014, pp. 236250.
This study proposed the enumeration of traffic congestion [9] H. Nguyen, W. Liu, and F. Chen, Discovering congestion propagation
patterns from traffic sensor data. It modified the frequent patterns in spatio-temporal traffic data, in Proc. 4th Int. Workshop on
pattern mining algorithm to enumerate spatio-temporally Urban Computing, Sydney, Australia, August 10, 2015.
connected congestion patterns effectively. The application [10] H. Nguyen, W. Liu, and F. Chen, Discovering congestion propagation
patterns in spatio-temporal traffic data, IEEE Trans. Big Data,
conducted using the proposed method on traffic sensor data in accepted, 2016.
the southern part of main island of the Okinawa, Japan [11] R. Agrawal and R. Srikant, Fast algorithms for mining association
confirmed its feasibility for extracting traffic congestion rules in large databases, in Proc. 20th Int. Conf. Very Large Databases,
patterns. It is confirmed to be useful in analyzing the Santiago de Chile, Chile, September 1215, 1994, pp. 487499.
difference of traffic congestion patterns by weather, time, and [12] J. Han, J. Pei, Y. Yin, and R. Mao, Mining frequent patterns without
candidate generation: a frequent-pattern tree approach, Data Min.
day of the week. Knowl. Discov., vol. 8, pp. 5387, 2004.
We propose the three points below for future discussion.
735
On Fridays without rainfall On Mondays without rainfall On Weekdays with rainfall
(a) 4:00 p.m. 4:15 p.m.
736