Você está na página 1de 9

Video Consolidation: Coverage and Bandwidth

Aware Video Transport in Smart Camera Networks


AbstractSmart camera networks are increasingly used in
applications such as surveillance, border control and traffic monitoring. The cameras are often wireless to reduce the overhead of
deployment. However, transmitting videos from multiple cameras
using a wireless network can overwhelm the limited available
bandwidth. In this paper, we propose video consolidation (VC), a
new model for video delivery that combines videos from multiple
cameras, removing redundancy. We investigate two mechanisms
for consolidating videos: an in-network aggregation scheme that
merges videos at intermediate routers, and a coordination scheme
that eliminates redundant videos at the source cameras. VC also
carries out network and coverage aware bandwidth allocation
to optimize the video quality cooperatively among the different
video streams to match the available bandwidth. We propose
and investigate different policies for bandwidth allocation. We
evaluate VC both in a small camera testbed as well as using simulation. We develop two simulation scenarios based on realistic
camera deployments and publicly available video. Experiments
show that VC can deliver around 10 dB gains in video quality
in scenarios where standard networking protocols fail to deliver
acceptable video quality.

To meet the demands of SCNs, we propose Video Consolidation (VC): a coverage and bandwidth aware video delivery
model for SCNs. We consider a situation where multiple
cameras are streaming videos, through one or more levels of
routers/base stations, towards an Observation Center (OC). VC
consists of the following mechanisms:

I. I NTRODUCTION
Multi-camera networks are commonly used in applications
such as traffic-monitoring and security surveillance. Typically,
these deployments rely on human operators to control cameras
and monitor collected video feeds; they are limited in scale,
cost, and operator attention span [1]. Recent developments
in network cameras and wireless technology have enabled
the deployments of Smart Camera Networks (SCNs). SCNs
collaboratively self-configure, often adapting to what they are
monitoring to improve the networks overall effectiveness, and
to lower reliance on human operators. As smart cameras are
becoming more affordable, SCNs are becoming more popular
and finding a range of new applications [2]. SCNs often are
deployed with limited planning using wireless networking and
evolve organically over years. The Chicago video surveillance
system, started around 2002, grew from zero to 8000 cameras
over its first five years, and now has an estimated 20,000
cameras [3].
Video delivery in large scale SCNs over a wireless network
is a challenging problem [4], [5]. Since video consumes
significant bandwidth, it is easy to overwhelm the limited
bandwidth of the network leading to poor-quality delivered
video. Parts of network can become congested and drop critical
video frames depending on the distribution of the cameras and
relays, as well as the concentration of interesting events in
the network. Therefore, it is essential to manage the data such
that coverage is maintained with an acceptable delivered video
quality.

Video combination and redundancy removal: The


video feeds from individual cameras are often combined
at the OC to present a single view to the observers
for ease of monitoring. The individual feeds often have
overlapping regions that are redundantly carried by as
part of different video streams within the network towards the OC only to be eliminated when the videos
are combined. This component of VC combines the
videos in the network to eliminate the redundancy. We
also investigate detecting the overlap and informing the
cameras to eliminate the redundancy at the source. Eliminating redundancy from data is similar to aggregation in
conventional sensor networks [6][9]. However, detecting
the overlap and merging videos from different view
points is significantly more complicated. We discuss this
component of the system in Section II, and overview
design and implementation issues in Section III. To our
knowledge, this is the first time in-network aggregation
is applied to video data.
Coverage aware bandwidth allocation: Having reduced
the size of the video by eliminating redundancy, the
second component of VC matches the data rates of the
combined videos to the available network bandwidth.
The available bandwidth at any point in the network is
shared with nearby cameras and routers; the decision on
bandwidth allocation must be coordinated. At the same
time, the value of the video streams being transmitted
may be different (as measured by the coverage area, or
other application specific metrics). We propose a number
of bandwidth allocation policies based on coverage-level
fairness (Section IV. While there is significant work in
wireless multimedia networking on adapting video rate
to available bandwidth, the vast majority of it targets a
single video feed/camera [10][12]. To our knowledge,
VC is the first proposal to adapt the bandwidth in a
coordinated way across multiple video streams while
taking into account both network and video application
characteristics.

We evaluate the proposed ideas both in a small multicamera testbed as well as using simulation. For simulation,

we model two SCNs based on recent deployments and using


publicly available video feeds. One scenario is modeled after
a highway monitoring SCN in Germany, while the other
models a border-monitoring deployment in Arizona on the
US-Mexico border. Since video processing operations such as
the ones proposed here are difficult to evaluate accurately in
network simulators that statistically model video generation,
we modify the simulator to operate on pre-collected videos
that are played back and processed during simulation. This
methodology allows us to accurately evaluate the impact of
networking and video events on the overall delivered video
quality. Our results (Section V) show that VC continues to
deliver high quality video with Peak Signal-to-Noise Ratio
(PSNR) video quality [13] of 33 dB when the standard
protocols deliver unviewable video quality (PSNR=22 dB) due
to large number of video frame errors.
In summary, the contributions of the paper are:
1) We propose video consolidation in SCNs to remove redundant overlap regions from videos being transported.
We investigated consolidation in the network, as well as
at the end cameras. To our knowledge, this is the first
proposal for aggregation in the context of video.
2) We propose a coordinated coverage and network aware
bandwidth allocation and video rate control in SCNs.
To our knowledge, this is the first video delivery model
targeted towards multimedia networking of multiple
related video streams.
3) We develop an accurate simulation model for video delivery in SCNs which is capable of playing back videos
across the network and perform in-network processing
of video frames.
4) We investigate VC both in a camera testbed as well
as using simulation, showing that it can dramatically
improve performance.
We discuss limitations and future extensions of the system
in Section VI. Section VII overviews related work in wireless
multimedia networking and sensor networks. We conclude and
present our future plans of expanding this system and conclude
in Section VIII.
II. V IDEO P RESENTATION AND C ONSOLIDATION M ODELS
We propose a coverage and bandwidth-aware video delivery
system, which we call Video Consolidation (VC). The first
component of VC (described in this section and the next)
efficiently constructs and delivers a consolidated video of the
region being monitored, while removing redundancy, to an
Observation Center (OC). The second component of the system (described in Section IV allocates the available bandwidth
to the routers in the system while taking into account both
network capacity and the importance of the different video
feeds. The routers then change their encoding rate to match
the bandwidth allocation.
A. Video Presentation Models
The presentation model characterizes how multiple video
streams are aggregated to produce a unified stream, which

Surveillance Area

C1

C2

Level 1 Routers

R11

Level 2 Router

R21

Operation Center

C3

C4

R12

OC

Fig. 1. Overview of Consolidation by Aggregation (CA)

is presented at the OC. The presentation model has significant implications on how videos can be combined and the
relative value of the different videos. In this paper, we use a
mosaicing based video presentation, which stitches multiple
videos from nearby regions to form a single mosaic of the
covered area [14]. While mosaicing is the most common
video presentation approach, other approaches such as video
summarization [15] and 3D holograms [16] are possible, and
would invite different in-network video combining functions.
B. Video Consolidation Approaches
In this section, we overview the proposed approaches for
video consolidation. We describe Consolidation by Aggregation (CA), which consolidates videos at intermediate routers
to remove redundancy, and Consolidation by Coordination
(CC) which pushes the intelligence to the cameras to remove
redundancy at the source.
1. Consolidation by Aggregation (CA): CA performs innetwork processing at the routers to remove redundancy in
videos before transmitting towards the OC. Figure 1 summarizes CA. Cameras (C1-C4) view different parts of the
surveillance area with possibly overlapping Fields-of-View
(FoVs). The video streams from the cameras are forwarded
towards the Observation Center (OC) by video routers.
The routers decode the video from multiple video streams,
and attempt to consolidate into a single video stream. The
output video stream contains the stitched mosaic video from
all the input video streams. Stitching eliminates the redundant
parts in the different camera FoVs. In the above scenario, the
video streams from different cameras are decoded and then
stitched into a single mosaic at Level-1 routers (R11, R12). If
an FoV of a camera is completely redundant, such as C4s FoV,
then its video can be eliminated. Multi-level aggregation is
enabled by aggregating at multiple points as the video stream
travels towards the OC.
2. Consolidation by Coordination (CC): CC applies in SCNs
that use computationally capable cameras. In this case, the
network coordinates to provide the cameras with information

RxRatR
Tx nextRhop
from
FeatureRPointR
+
srcs Decode Detection

ImageR
Transformation

Merge

ImaginaryR
SeamRLine

+
CamR1

CamR2

OrigRImg-1

Remap-1

Mask-1

Remap-2

V1
V2
Vn
Input Video
Streams

Image
Stitcher
Frame Buffer

images
Masks and
Transformation
Matrix

stitched
image

Encoder

Encoding
Rate
Estimator

Video
frames tx
to the
next-hop

MosaicRImg

+
OrigRImg-2

Encode
+R
TxRto
nextRhop

Mask-2

Fig. 3. Components of Video Fusion phase

Fig. 2. Creating a Mosaic Image

about redundant regions, allowing them to prune these regions


without transmission. This approach eliminates the in-network
processing at the routers.
III. C ONSOLIDATION D ESIGN AND I MPLEMENTATION
We first describe our generic video mosaicing technique to
combine videos, and then explain the design and implementation of video combination components in CA and CC.
A. Video Mosaicing: Background
Video Mosaicing is performed by repeatedly stitching
frames from the cameras which cover neighboring Fields-ofview (FoVs). As illustrated in Figure 2, the video stitching
has the following steps: (1) decode each video into component
images, (2) detect the common points, called feature points,
between two images, (3) calculate the transformation matrix
that assists in computing how the original image is projected
in the final mosaic, (4) compute masks that indicate which
parts of the images are retained in the final mosaic (5) merge
the images into one mosaic image, and finally (6) if needed,
encode the mosaic image and transmit to the next hop. We
extend a well-known image stitching tool, PanoTools, to stitch
videos [14].
B. Video Combination in CA
A nave approach for video consolidation is to repeat the
video mosaicing process (Figure 2) at every intermediate
router for every video frame that is received at the router.
However, such a system can be computationally expensive for
video delivery in large distributed SCNs.
We propose a three phase system that efficiently stitches the
video at the intermediate routers. The motivation is to reduce
the overhead of computationally intensive steps Feature
point detection, transformation-matrix computation and mask
generation in mosaicing. Coincidentally, these expensive
components are not required to be executed at every frame
if the cameras FoV parameters does not change. Hence, we
preprocess these steps once in a Mosaicing parameter generation phase, and use the resulting parameters repeatedly to
stitch images from every frame. The above approach provides
a speedup of up to 15x in our system. We now describe the
three phases in detail.
1. Mosaicing parameter generation:
This phase is executed at a router for the first frame received
from the camera or another downstream router. Using the

images from the frame, the system computes the mosaicing


parameters. This phase can be locally performed at a router,
but for simplicity and due to the limitations of the mosaicing
tool (PanoTools) we currently perform this operation centrally.
Each Level-1 router (Figure 1 analyzes the input video frames,
and computes the mosaicing parameters for combining its
input videos. The Level-2 router analyzes the consolidated
streams that arrive as its input Level-1 routers, and computes
a second-level mosaicing parameters. This process is repeated
until the masks are created to generate a unified video.
2. Connection Re-establishment:
Once the aggregation points receive the transformation matrices and masks, they break the connection from the cameras to
the OC, and instruct the cameras to open a new connection to
the aggregation point. The original camera video is sent over
this connection. The router then opens another connection to
the next level aggregation point.
3. Video Fusion:
This phase is executed at every packet reception. Video fusion
is performed by decoding the video frames, merging the
frames into a mosaic, and, finally, encoding and transmitting
the mosaic frame to the up-link router (as shown in Figure 3).
The input video streams to the routers consist of encoded
frame data. The data for one frame can be of arbitrary size,
and hence can be made up of multiple packets or fragments.
To track fragments of the frame, we transmit the video ID,
sequence number, frame number, frame size and fragment
number in the header of each video data packet.The cameras
are time-synchronized with a tolerance matching the dynamics
of the video (typically within milliseconds). This is a typical
requirement on multi-camera systems when the streams are
being monitored together.
A Frame Buffer stores the packets until they are decoded
into a single frame. Once the complete frame is received,
it is decoded into an image, and stored in an image buffer
(currently, in the YUV format). The router waits for frames
from all its input streams to be available before merging. Once
the frames are available, the router merges the images, encodes
the image and transmits to the up-link router.
We handle complete and partial frame failures by observing
the sequence number of the frames. If an aggregation point
receives all frames from n but not from sequence n 1, then
it concludes that the previous frame is not fully received. For
all the correctly received frames, it creates an image using the
YUV information stored in its buffer. For the partially received
frames, it decodes the video frame by packing null values for

the remaining bytes.


Once all the frames from the incoming videos are received
at the aggregation router, we create a new frame for the output
video by: (1) transforming the image using pre-calculated mosaicing parameters, and (2) merging the transformed images.
The Encoding rate estimator sets the encoding rate of the
outgoing video based on the available bandwidth. Selecting
an appropriate rate for video transmission requires the the
router to judge the available capacity and videos that are
at the other cameras/routers, and intelligently choose a rate.
This process ensures that the video delivery is coverage- and
network-aware, and we explain the details in Section IV.
C. Video Combination in CC
While a distributed implementation is possible for video
combination in CC, we use the OC to periodically stitch
the video frames to form one mosaic video, and in the
process compute the transformation matrices and masks for
each camera. This information is delivered to the respective
cameras. Thus the redundant regions of the camera FoVs are
suppressed at the camera itself. Finally, the OC uses mosaicing
to combine the frames from all the cameras.
IV. C OVERAGE AWARE C OORDINATED BANDWIDTH
A LLOCATION
In this section, we discuss the bandwidth allocation component. We first describe the bandwidth allocation schemes in
CA, followed by a brief description of allocation in CC.
A. Bandwidth Allocation in CA
Recall that, in CA, the encoding rate estimator (Figure 3)
is responsible for intelligently setting the encoding rate of the
outgoing video. We now propose various schemes for encoding
rate assignment that is coverage- and network-aware. Our
schemes are designed by observing that all the nearby routers
should coordinate to split the available bandwidth based on the
the area that the video is covering, and the collective available
capacity among the routers.
We first explain the need for intelligent bandwidth allocation
by comparing two illustrative scenarios. Consider the scenario
in Figure 1, which has a symmetric structure; each router has
an equal number of streams coming from cameras or other
routers. Consolidation is more complicated in general scenarios with asymmetric topologies. For example, consider the
proposed wide-area border surveillance application in Arizona
as shown in Figure 4 (b). Some cameras are a few hops away
from the OC, and are therefore able to obtain larger throughput
(e.g., C4), while other farther cameras can expect significantly
lower throughput (e.g., C1 and C2). Motivated by providing
fair coverage in surveillance applications, we propose three
aggregation schemes that choose suitable aggregation points
and video rates in general asymmetric topologies.
1. Hierarchical Point Fair (HP-Fair) Aggregation: SCNs
in which cameras are connected to the OC using hierarchical
network topologies (clustered topologies, cellular or WiFibased) can be aggregated at their Hierarchical aggregation

Command3and3
Control3Center3
zory3the3OCb

Surveillance
Towers

USfMexico3Border

zab3Existing3USfMexico3Border3at3Nogalesy3Arizona
Snapshot3of3the3complete3
surveillance3area

CC

BS6

BS1

C1

BS2

BS3

C2
C3

BS4

BS5

C4

USfMexico3Border

zbb3Proposed3Surveillance3Topology
Fig. 4. Surveillance topology at Nogales, Arizona: Figure (a) shows the
existing terrain and infrastructure. A part of the area (inside the blue-dashed
line) is considered in Figure (b). We propose real-time video streaming from
cameras (C1-C4) to the control center(CC) using WiMAX base stations (BS1
BS6).

Points (HP) such as cluster-heads and base-stations. The HPFair algorithm assigns a single aggregation rate to each HP
irrespective of the number of video streams passing through
it. HP-Fair controls the network traffic directly to make sure
that congestion does not occur.
HP-Fair can be unfair since it assigns a single rate to all
incoming videos irrespective of the contents of the video. For
example, if one incoming video has a larger coverage region
(say, since the camera covers a larger region or the video
is being transmitted by a router that has already aggregated
multiple videos), and another has a smaller coverage region,
then HP-Fair assigns equal rate to both the videos. As a result,
the algorithm assigns relatively less bandwidth to the videos
that are that transmitted over multiple hops.
2. Cam-Fair aggregation: Cam-Fair algorithm improves HPFair by explicitly adjusting the encoding bit-rate from each
camera. The OC disseminates a constant value denoting the
bit-rate per camera, and each aggregation point encodes the
outgoing mosaic video based on the number of camera inputs
that are present in its mosaic. For example, for the topology
in Figure 4 (b), if the bit-rate per camera is 1 Mbps, then BS2
aggregates its mosaic at 2 Mbps, and BS3 at 3 Mbps, and the
final mosaic is aggregated at BS4 at 4 Mbps.
While Cam-Fair algorithm explicitly accounts for camera
utility, it does not directly account for the final area covered
at the OC. This is because Cam-Fair does not consider the
part of the FoV of the camera video that finally makes it to
the mosaic video at the OC. For example, in Figure 4 (b),
cameras C1, C2 and C3 are closer to each other, and hence
have a large overlap in their FoVs. However, camera C4 has

a less overlap with other cameras, and contributes more area


of the final mosaic. Hence, the encoding rate C4 should be
higher.
3. Coverage Fair (Cov-Fair) aggregation: Cov-Fair aggregation assigns a bit-rate for each camera FoV that is proportional
to the contribution of the camera FoV in the final mosaic.
This is performed by first estimating the bit-rate of the video
that the OC can receive. OC measures the available network
bandwidth, and sets a single bit-rate (B) for the final mosaic
video. If the total area covered at the OC is A, and camera
C contributes an area aC , then the video stream is assigned a
bit-rate of:
BaC
.
AggRate(C) =
A
The aggregation bit-rate at each HP is set to the sum of
AggRate(C), for all the video stream from cameras C that
pass through the HP.
B. Bandwidth Allocation in CC
Recall that, in CC, each camera is aware of the the nonoverlapping area covered since the OC transmits the masks
of the FoV that the cameras have to transmit. Using this
information and the available bandwidth, the cameras encode
each video with a rate that is computed similar to Cov-Fair
scheme in CA.
V. P ERFORMANCE E VALUATION
We evaluate the effectiveness of different algorithms proposed for consolidation and discuss our findings. We study
various types of surveillance topologies using both WiFi and
WiMAX. In addition, we evaluate realistic CA and CC using
videos collected from our camera testbed as well as public
surveillance videos.
We have implemented the video delivery protocols with innetwork Mosaicing in the Qualnet simulator [17]. Multi-level
encoding and decoding is sensitive to frame packet losses since
packet drops at one hop have a cascading effect on mosaic
videos at later hops. Therefore, existing VBR or trace-based
simulation is not sufficient [18]. We altered the simulator to
provide simulation-time in-network encoding and decoding of
video streams using the ffmpeg library [19].
We first evaluate a simple two-level hierarchical topology
using videos generated from our testbed cameras. We then
study a large symmetric topology. Finally, we analyze CA
and CC in asymmetric topologies where bandwidth allocation
strategies have a significant effect on the performance.
A. Two-level Hierarchical Topology
We use a small camera testbed consisting of four Axis 213
network cameras viewing different parts of a panoramic image
(Figure 5). We set the cameras to have an approximate overlap
of 20% in their FoVs. Each camera streams a high-resolution
4 Mbps MPEG encoded video. These videos are collected and
then replayed in the simulator environment. This approach aids
in validating realistic stitching from multiple cameras that are

Surveillance area

Cam 1

Cam 2

Cam 3

Cam 4

Fig. 5. Miniature Camera Testbed

viewing different parts of the scene, while using real videos


from a small testbed.
The testbed scenario is similar to Figure 1. We use constantrate 54 Mbps IEEE 802.11a protocol with standard parameters.
We also set a large IP Queue Size of 50 kbytes to absorb
bursty nature of the video traffic. Video frames especially
the aggregated frames are generally large and hence require a
larger network layer buffer to avoid queue drops before transmission. We did not consider the effect of fading, shadowing
or interference in this study to evaluate the base operation of
the video delivery protocol.
We compare CA with standard routing, which is unaware
of coverage, nature of video traffic or network capacity. Video
quality is measured using the standard Peak Signal-to-Noise
Ratio (PSNR) of the video frames [13]: higher PSNR indicates
better video quality. Usually, a PSNR of above 30 is considered
good for compressed MPEG videos. More advanced video
metrics are used later (in Section V-B) to provide deeper
insight into VC operation.
Figure 6(a) shows the PSNR and frame success rate (number
of frames received without any errors) of the consolidated
video. Ideally, the network would be able to support all the
video traffic since overall average demand for all the videos
is 48 Mbps is lower than the nominal capacity of the channel
(54 Mbps). However, the effective capacity of the channel is
significantly lower due to the MAC frame and contention
overheads. The standard model when transmitting videos at
rates greater than 3-4 Mbps results in many frames with errors,
which drastically reduces the video quality. We see this effect
in spite of large IP queue size.
This behavior is exacerbated by the bursty nature of the
video transmission; we noticed a peak-to-average bit-rate of
above 10. Video encoding consists of two type of video
frames: key frames and non-key frames. In general, key frames
are much larger than other frames. In this scenario with 400
kbps videos, the average key frame size is around 25,000
bytes, where as the non-key frames are 1500 bytes. Hence,
there will be a sudden burst of traffic when cameras transmit
key frames (approx. once a second).
In addition, we also measure the receiver throughput (received bits per second), goodput (bit-rate of frames that are
received without errors) and energy consumption as shown in

30
0.7

PSNR

0.6
25

PSNR77
Standard7Routing

0.5
0.4
0.3

20 Frame7Success7Ratio77
Standard7Routing

0.2

15

4
6
8
Aggregated7Video7Rate7uin7MbpsF

0
10

ReceiverGThroughput
RecveiverGGoodput

10
RecvGThroughput
(StdGRouting)
5
RecvGGoodput
(StdGRouting)

(a) Video Quality and Frame Success Rate

0
0

4
6
AggregationGRateG(inGMbps)

StddRouting
Aggregation

38

SenderGThroughput

0.1
15
0

40

SenderGThroughputG
(StdGRouting)
EnergydConsumptiond(indWatt)

0.8

20

ThroughputG(inGMbps)

1
PSNR
Frame7Success7Rate 0.9

Ratio7of7frames7received7without7errors

35

36
34
32
30
28
26
24
22

20
0

10

(b) Throughput and Goodput

4
6
AggregationdRated(indMbps)

10

(c) Network energy consumption

Fig. 6. Video Quality, Network Throughput and Energy Efficiency: Video transmission on network should be tuned such that frame losses are minimum for
good video quality. As frame error rate increases, video quality deteriorates, and energy is over-spent.

Red(Video

Blue(Video

Green(Video

40

PSNR

35
30
25
20
15
1

SSIM

0.9
0.8
0.7
0.6

No(Agg

50

75

100
200
300
400
Aggregation(Rate((in(kbps)

500

600

Fig. 8. Video quality with and without CA

Fig. 7. Surveillance infrastructure on Highway-A7 in Germany. The inset


shows proposed the topology for monitoring of a 500 m stretch of the highway.

Figure 6 (b)(c). We observe that goodput and frame losses are


better indicators of video quality, and received throughput does
not correlate strongly with the quality. The network should
be controlled to reduce frame error rate for effective video
transport. Routers must avoid congestion by reducing their
transmission bit-rate if they observe frame drops.
CA also leads to energy savings (Figure 6(c)). Transmission
near the ideal aggregation rate saves energy by around 13%,
in addition to improving quality. The energy saving occur
because of the lower transmission rates and retransmissions.
B. Large-scale Symmetric Topology
We simulate surveillance of a part of Highway-A7 in Germany (Figure 7). The highway already has surveillance camera
infrastructure where cameras are placed at half a kilometer
from each other. The current camera topology does not cover
the highway traffic completely.

We extrapolate the environment additonal cameras are deployed to improve the coverage. We assume that the cameras
are placed every 50 m (say, on light poles). The inset figure
shows the proposed camera placement over a stretch of 500 m.
We evaluate a hierarchical topology to deliver the videos
from these cameras. We organize the cameras and routers
into different clusters (as shown in the inset of Figure 7).
Four consecutive cameras and the corresponding router are
considered as one cluster. For illustration, the cameras and
routers in a cluster are colored differently (red, blue and
green). Four cameras streams are delivered to an aggregation
point, that forwards the videos to a relay. The relay may be the
video collection center, or in turn forward the videos towards
a centralized collection center using, say, a cellular network.
In addition to PSNR, we measure the video quality using
the Structural Similarity (SSIM) [20] metric. SSIM measures
quality of decoded video by comparing the structure of original
image with the decoded video instead of comparing pixel
values (as done in PSNR). SSIM is a value in the range
[1, 1], with higher values corresponding to better quality.
SSIM evaluates the video quality as perceived by the human
eye, and is therefore more accurate than an pixel-level metric
such as PSNR [20].
Figure 8 shows the quality of received videos. The quality
of video in the standard video delivery case is significantly
lower than the videos aggregated at different rates; aggregation

is able to reduce the video size and adjust the encoding rates
to fit the transported video within the available capacity.
C. CA in Asymmetric Topologies
Thus far, we evaluated symmetric topologies where explicit
bandwidth allocation is not necessary. We now examine larger
asymmetric scenarios using a realistic border surveillance
scenario with long-range WiMAX links.
Border Surveillance Scenario: Border surveillance requires
monitoring large remote and inaccessible areas. Figure 4(a)
shows the Nogales area in Arizona where the US border is
monitored using an SCN [21]. Currently, there are seven fixed
surveillance and communication towers [21]. Each surveillance tower covers a portion of the border area that is already
identified, and transmits the real-time video to the OC. Such
real-time video is vital for instantly detecting and reacting
to illegal border breaches. The sparse deployment of cameras
creates large coverage holes that are unmonitored, thus prone
to illegal movement of people and smuggling of drugs.
In the scenario we consider, we propose deployment of
cameras near the border with the support of WiMAX Base
Stations (BS) to relay the video to the OC. WiMAX is
suitable for such scenarios because of its longer range (approx.
2-3 km). We choose simulation parameters to match the current
infrastructure [21]; each tower is 21 m high and transmits at
43 dBm power over a 7 GHz range. Using this configuration,
we are able to achieve a link capacity of around 30 Mbps at
500 m, and 10 Mbps at 2.5 km range. We assume that each BS
operates on a different frequency range.
We simulate an example scenario where four cameras (C1 to
C4) are randomly deployed in a strip of 8 1 km2 , as shown
in Figure 4(b). The inset of the figure shows the snapshot
of the sample surveillance video that overlooks long border.
Using projective geometry and camera parameters, such as
focal length, we map the area seen by the camera to a part of
the overall Field-of-View (FoV) of the surveillance area [22].
We cut the parts of the FoVs of the original video and scale
all the videos to the same dimension. We then encode the
videos at 2 Mbps, which are streamed in the simulator from
the camera nodes.
Recall that in the HP-Fair scheme, each base-station (BS1BS6 in Figure 4) aggregates at the same rate. Figures 9(a)
and 9(b) shows the video quality when each HP aggregates
at different rates, and compares it with standard video delivery.
HP-Fair provides significant improvement. As the encoding
rate increases towards network capacity, the video quality of
HP-Fair increases. However, as the capacity exceeds 2 Mbps,
we observed substantial rates of frame drops (as seen by the
HP-Fair Agg 3 Mbps curve in Figure 9(a)). These drops in
turn cause lower and fluctuating measured PSNR and SSIM
(Figure 9(b)).
Cam-Fair, Cov-Fair and CC schemes also exhibited behavior
similar to HP-Fair, where the video quality increases as the
encoding rate increased until a threshold value. However the
drop in video quality occurred at different thresholds. We
observed that Cam-Fair, Cov-Fair and CC were able to sustain

a rate of 3 Mbps video at the OC; note that HP-Fair was


saturated at 2 Mbps.
Figure 9(c) compares all the schemes when the collective
consolidated video is delivered to the OC at 3 Mbps, which
is the bit-rate that provides best video quality observed across
all schemes. We also include a nave scheme (Single-Point
Aggregation or SP-Agg) where the first router where all videos
meet (BS4 in Figure 4) is chosen as the only aggregation point.
The CA schemes adjust their encoding rate at the aggregation
points accordingly. CC, Cam-Fair and Cov-Fair perform the
best across all the schemes.
VI. D ISCUSSION AND L IMITATIONS
A complete Video Consolidation system for wide-area
surveillance has to address several challenges such as wireless
propagation effects, nuances of video encoding schemes, and
management protocols that reconfigure the SCN when the
network or camera topology changes. The proposed system
is a first attempt that contributes novel system design and
approaches for video delivery through consolidating the videos
and joint rate control. We highlight the current limitations of
the system to outline areas of future research.
a) Stitching at video-speed: We use a modified version
of the well-known software PanoTools [14] to stitch multiple
images. The base implementation requires 41 seconds to stitch
a single frame of 2Mbps video on a core i5 laptop. Our
optimized design (Section III-B), which avoids many of the
base computations if the FoV did not change, has a video
stitching rate of approximately 1 frame in 2 seconds. Thus,
stitching is about two orders of magnitude too slow on general
processors to enable supporting 10-30 frames per second in
real time. However, existing parallelization techniques can
reduce the stitching time to this range. For example, FPGA and
GPU implementations stitch each frame at the speeds of 200
frames per second [23], [24], even without our optimizations.
b) Camera topology and FoV changes: We assume that
the cameras change their FoV slowly and may be treated as
static within the time granularities we consider. This is usually
the case in large scale surveillance systems where cameras are
configured to focus on specific areas. However, if the camera
FoV changes or new cameras are added, the aggregation and
bandwidth allocation computations have to be re-executed to
account for changes in the video sources. In the current model,
infrequent changes in topology can be handled by re-invoking
Mosaicing parameter generation.
c) Supporting Routing Changes: We assume that the
routes from the cameras to the OC do not change. It is
not difficult to generalize this assumption, by periodic or
route-change-driven recalculation of the bandwidth allocation.
Calculating best routes for video aggregation can substantially
impact performance; for example, it can allow streams with
overlap to meet sooner in the network to eliminate redundancy.
We wish to pursue this direction in the future.
VII. R ELATED W ORK
Video consolidation is a new approach for coverage and
bandwidth aware end-to-end video delivery in SCNs. VC

50

PSNR

60
PSNR

50

HPFair Agg (3 Mbps)


Frames with Error

40

PSNR

Standard Routing
HPFair Agg (2 Mbps)

40
30

20
0
1

1
SSIM

SSIM

SSIM

1
0.9

50

100

150
200
250
Frame Number

300

(a) Frame Level Quality

350

400

0.7

0.9
0.8

0.8
0
0

30
20

20

0.5

40

Std Routing

1
2
3
Aggregation Rate (in Mbps)

(b) Summary Quality

0.7

StdMRouting

SPAgg

HPFair

CamFair

CovFair

Coord

(c) Performance for output = 3 Mbps video at OC

Fig. 9. Video quality in HP-Fair and other schemes: HP-Fair and other schemes achieves good video quality when the aggregation rate is 3 Mbps. Among
the different schemes, Cov-Fair and Cam-Fair perform the best.

proposes novel in-network video consolidation to eliminate redundancy in overlapping video stream. In addition, we believe
it is the first work to use collaborative video aware bandwidth
allocation out across the network to optimize the consolidated
video quality given both video and network characteristics. In
this section, we discuss related work relative to specific aspects
of the proposed framework.
Video stitching: Related to video consolidation, stitching of
multiple videos has been investigated at the OC to provide
a combined view for human operators [25][27]. El-Saban
et al. propose real-time stitching of videos streamed from
different mobile phones on a centralized server [25]. Similarly, Akella et. al. stitch videos deployed in UAVs for a
battlefield monitoring application centrally. Qin et al. [27]
consider online panorama generation of videos transmitted by
networked robotic cameras. VC differs from these works in
that mosaicing is carried out in the network to reduce the
bandwidth requirements. Alternative models for consolidation
corresponding to other video presentation requirements are
also possible within the proposed framework.
Camera placement works are related to consolidation in
that they attempt to reduce overlap between cameras. For
example, Yildiz et al. [28] present a camera placement scheme
to generate panoramic video of a given area with minimum
number of cameras. However, overlap is often desired to
improve coverage, and enhance redundancy, especially when
the cameras are controllable. Moreover, planning may not
be possible for organically growing or mobile SCNs. In
such situations, VC can complement planning to reduce the
overhead from resulting coverage overlap.
Multimedia Wireless Networking: The second component
of VC is coordinated bandwidth allocation. Several studies
optimize video transmission on a network using scheduling,
routing and video rate-control [10][12]. Video summarization
is used to further reduce the required bandwidth by presenting
a non-video summary, such as text and motion paths [29],
[30]. These works focus on reducing the bandwidth usage of a
single video stream and do not consider SCN scenarios where
multiple cameras are concurrently streaming.
Other studies consider transmission of multiple videos over
network. Liew et al. [31] propose joint encoding of multiple

videos before transmitting them on the network. Video streaming of multiple streams in wireless mesh networks is considered by Navda et al. [32], where the focus is on improving
the throughput by identifying and using spatially separated
routes for video delivery. Our work is orthogonal: we propose
a network-aware delivery model in SCNs rather than routing
or encoding. Thus, we believe the above schemes can be used
in conjunction with VC to further improve performance.
Data aggregation in sensor networks: Aggregation in traditional sensor networks bears conceptual similarity to video
consolidation in terms of selecting aggregation points and
using appropriate aggregation functions. Specifically, to aggregate data in sensor networks, an aggregation tree is constructed
from sensors to the sink [8], in which all nodes except
the leaf nodes can be the data aggregation points. Common
aggregation functions used are MAX or AVERAGE [8], [33].
However, these approaches are not directly applicable to
SCNs because: (1) The existing aggregation functions, such as
AVERAGE and MAX, can not be directly applied to the video
data; (2) Video data consumes a significantly more bandwidth,
and thus demands for judicious estimation of the right amount
of traffic on the network for efficient bandwidth management;
(3) Unlike traditional sensor nodes, SCNs use more powerful
network routers or base-stations. These high-end devices can
be exploited to perform more in-network processing.
VIII. C ONCLUSIONS AND F UTURE W ORK
This paper proposes video consolidation (VC), a bandwidth
and coverage aware video transport model for Smart Camera
Networks (SCNs) deployed for wide-area surveillance. More
precisely, VC consists of two complementary mechanisms:
(1) videos are consolidated within the network to eliminate
redundancy (areas with overlapping coverage); (2) coordinated
network bandwidth allocation combined with video rates adjustment to control congestion while improving the coverage
at the Operation Center (OC). The paper also contributes
a simulation technique incorporating video streams within a
network simulator, and three realistic simulation scenarios
based on real deployments of SCNs and public video feeds. We
discover that VC substantially reduces network traffic, while
improving video quality for all three scenarios, as well as in
a small camera testbed.

We investigated two approaches for consolidation. The first


approach, consolidation by aggregation (CA), detects and
blends areas of overlap in videos at intermediate routers. The
second approach, consolidation by coordination (CC), further
improves on CA by having the cameras avoid sending the
redundant data to begin with. For bandwidth allocation, we
investigated different policies for allocating the bandwidth
among different routers and showed that the coverage aware
policies provide the best performance. We implemented VC
(all variants) in a network simulator.
Our future research centers around developing systems and
protocols for organic and intelligent SCN systems. We plan
to develop automatic discovery of cameras, dynamic consolidation and consolidation-aware routing to allow distributed
VC protocols. In the video processing domain, we want to
explore fast and efficient video processing using a non-bursty
video codec, such as the Scalable Video Codec [34].
R EFERENCES
[1] H. Aghajan and A. Cavallaro, Multi-Camera Networks: Principles and
Applications. Academic Press, 2009.
[2] Motorola, Wireless Video Surveillance: The New Picture of Quality
and Economy, 2008. [Online]. Available: http://www.motorola.
com/web/Business/Products/Wireless\%20Networks/Wireless\
%20Broadband\%20Networks/Point\%20to\%20Multi-point\
%20Networks/Canopy\%20Products/Canopy/ Documents/static\
%20files/Video\%20Surveillance\%20Economics.pdf
[3] H.
Collins,
Video
Camera
Networks
Link
RealTime
Partners
in
Crime-Solving,
Feb
2012.
[Online].
Available:
http://www.govtech.com/public-safety/
Video-Camera-Networks-Link-Real-Time-Partners-in-Crime-Solving.
html
[4] K. Abas, C. Porto, and K. Obraczka, Wireless Smart Camera Networks
for the Surveillance of Public Spaces, Computer, vol. 47, no. 5, pp. 37
44, May 2014.
[5] N. G. L. Vigne, S. S. Lowry, J. A. Markman, and A. M. Dwyer,
Evaluating the Use of Public Surveillance Cameras for Crime Control
and Prevention, Sep 2011.
[6] L. Krishnamachari, D. Estrin, and S. Wicker, The impact of data
aggregation in wireless sensor networks, in Distributed Computing
Systems Workshops, 2002. Proceedings. 22nd International Conference
on, 2002, pp. 575 578.
[7] E. Fasolo, M. Rossi, J. Widmer, and M. Zorzi, In-network aggregation
techniques for wireless sensor networks: A survey, Wireless Communications, IEEE, vol. 14, no. 2, pp. 70 87, april 2007.
[8] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong, TAG: a
Tiny AGgregation service for ad-hoc sensor networks, SIGOPS Oper.
Syst. Rev., vol. 36, no. SI, pp. 131146, Dec. 2002.
[9] C. Intanagonwiwat, R. Govindan, D. Estrin, J. Heidemann, and
F. Silva, Directed diffusion for wireless sensor networking, Networking, IEEE/ACM Transactions on, vol. 11, no. 1, pp. 2 16, Feb 2003.
[10] E. Setton, T. Yoo, X. Zhu, A. Goldsmith, and B. Girod, Cross-layer
design of ad hoc networks for real-time video streaming, Wireless
Communications, IEEE, vol. 12, no. 4, pp. 59 65, aug. 2005.
[11] S. Milani and G. Calvagno, A low-complexity cross-layer optimization
algorithm for video communication over wireless networks, Trans.
Multi., vol. 11, no. 5, pp. 810821, 2009.
[12] M. van der Schaar and D. S. Turaga, Cross-Layer Packetization
and Retransmission Strategies for Delay-Sensitive Wireless Multimedia
Transmission, Multimedia, IEEE Transactions on, vol. 9, no. 1, pp. 185
197, jan. 2007.
[13] S. Winkler and P. Mohandas, The Evolution of Video Quality Measurement: From PSNR to Hybrid Metrics, Broadcasting, IEEE Transactions
on, vol. 54, no. 3, pp. 660 668, sept. 2008.

[14] H. Dersch et al., Panorama tools, 19982008. [Online]. Available:


http://panotools.sourceforge.net/
[15] D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian,
L. Bao, M. G. Christel, and A. Hauptmann, Beyond audio and video
retrieval: towards multimedia summarization, in Proceedings of the 2nd
ACM International Conference on Multimedia Retrieval, ser. ICMR 12.
New York, NY, USA: ACM, 2012, pp. 2:12:8.
[16] Cisco, Taking 3D to the Next Level, Feb 2011.
[Online]. Available: http://newsroom.cisco.com/feature-content?type=
webcontent&articleId=5910450
[17] Qualnet network simulator, http://www.scalable-networks.com/.
[18] J. Klaue, B. Rathke, and A. Wolisz, EvalVid: A Framework for
Video Transmission and Quality Evaluation, in Computer Performance
Evaluation. Modelling Techniques and Tools, ser. Lecture Notes in
Computer Science, P. Kemper and W. Sanders, Eds. Springer Berlin /
Heidelberg, 2003, vol. 2794, pp. 255272.
[19] Ffmpeg and libavcodec. [Online]. Available: http://ffmpeg.org/
[20] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, Image quality assessment: from error visibility to structural similarity, Image Processing,
IEEE Transactions on, vol. 13, no. 4, pp. 600 612, april 2004.
[21] Department of Homeland Security, Integrated Fixed Towers.
[Online]. Available: https://www.fbo.gov/?s=opportunity&mode=form&
id=5d1e009b6009b6a503ddcb5e61af8258&tab=core& cview=1
[22] R. Radke, Multiview Geometry for Camera Networks, Multi-camera
networks: principles and applications, 2009.
[23] L. Chang and J. Hernndez-Palancar, A Hardware Architecture for SIFT
Candidate Keypoints Detection, in Progress in Pattern Recognition,
Image Analysis, Computer Vision, and Applications, ser. Lecture Notes
in Computer Science, E. Bayro-Corrochano and J.-O. Eklundh, Eds.,
vol. 5856. Springer Berlin / Heidelberg, 2009, pp. 95102.
[24] C. Zach, D. Gallup, and J.-M. Frahm, Fast gain-adaptive KLT tracking
on the GPU, in Computer Vision and Pattern Recognition Workshops,
2008. CVPRW 08. IEEE Computer Society Conference on, June 2008.
[25] M. A. El-Saban, M. Refaat, A. Kaheel, and A. Abdul-Hamid, Stitching
videos streamed by mobile phones in real-time, in Proceedings of the
17th ACM international conference on Multimedia, ser. MM 09. New
York, NY, USA: ACM, 2009, pp. 10091010.
[26] P. Akella and A. Ganz, Panorama - pervasive networking architecture
for multimedia transmission in battlefield environments, in Military
Communications Conference, 2004. MILCOM 2004. 2004 IEEE, vol. 2,
oct.-3 nov. 2004, pp. 1107 1113 Vol. 2.
[27] N. Qin and D. Song, On-demand sharing of a high-resolution panorama
video from networked robotic cameras, in Intelligent Robots and
Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, 29
2007-nov. 2 2007, pp. 3113 3118.
[28] E. Yildiz, K. Akkaya, E. Sisikoglu, M. Sir, and I. Guneydas, Camera
Deployment for Video Panorama Generation in Wireless Visual Sensor
Networks, in Multimedia (ISM), 2011 IEEE International Symposium
on, dec. 2011, pp. 595 600.
[29] W.-C. Feng, E. Kaiser, W. C. Feng, and M. L. Baillif, Panoptes:
scalable low-power video sensor networking technologies, ACM Trans.
Multimedia Comp. Comm. Appl., vol. 1, no. 2, pp. 151167, 2005.
[30] W. Hu, T. Tan, L. Wang, and S. Maybank, A survey on visual surveillance of object motion and behaviors, Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on, vol. 34, no. 3,
pp. 334 352, Aug 2004.
[31] S. Liew and C.-Y. Tse, Video aggregation: adapting video traffic for
transport over broadband networks by integrating data compression
and statistical multiplexing, Selected Areas in Communications, IEEE
Journal on, vol. 14, no. 6, pp. 1123 1137, aug 1996.
[32] V. Navda, A. Kashyap, S. Ganguly, and R. Izmailov, Real-time Video
Stream Aggregation in Wireless Mesh Network, in Personal, Indoor
and Mobile Radio Communications, 2006 IEEE 17th International
Symposium on, sept. 2006, pp. 1 7.
[33] S. Lindsey, C. Raghavendra, and K. Sivalingam, Data gathering algorithms in sensor networks using energy metrics, Parallel and Distributed Systems, IEEE Transactions on, vol. 13, no. 9, Sep 2002.
[34] H. Schwarz, D. Marpe, and T. Wiegand, Overview of the Scalable
Video Coding Extension of the H.264/AVC Standard, Circuits and
Systems for Video Technology, IEEE Transactions on, vol. 17, no. 9,
pp. 1103 1120, Sept 2007.

Você também pode gostar