Você está na página 1de 63

Xplore market| Entidade Promotora: Parceiro:

31/07/2018

Caching de Modelos
Geográficos
aplicáveis a
Dispositivos Móveis
Estado da Arte

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

Índice
1 Introdução .............................................................................................................................. 3
2 Estratégias de Caching: Arquiteturas e Algoritmos ........................................................... 8
3 Caching de Modelos Geográficos ....................................................................................... 32
3.1 Desenvolvimentos em ambiente web com tecnologia em nuvem .............................. 32
3.2 Desenvolvimentos em ambiente mobile com tecnologia em nuvem ......................... 51
4 Considerações Finais e Desdobramentos ........................................................................... 62
5 Referências ........................................................................................................................... 63

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

1 Introdução

O estado da arte referente à linha de investigação 2: Caching de Modelos Geográficos


pretende identificar, a partir de revisão bibliográfica, os principais desenvolvimentos teóricos e
empíricos para dispositivos móveis e aplicáveis ao turismo.
A base de dados com informações geográficas é denominada GIS (Geographic
Information Systems) e se caracterizam pela grande quantidade de informações.
Consequentemente, esse sistema coloca questões importantes para uso em plataforma mobile,
pois requerem grande espaço de memória, alta capacidade de processamento e, por isso, grande
uso da bateria do dispositivo móvel. Questões de caching surgem em cenários de conectividade
limitada.
O alvo do desenvolvimento aqui pretendido é uma situação em que o usuário possa
executar, de forma autónoma e continuada, as suas “aventuras” turísticas com períodos em que
poderá não ter qualquer acesso à conectividade no seu telefone inteligente. Por isso, é
fundamental encontrar formas em que o serviço seja mantido (preservando pelo menos as
funcionalidades principais da aplicação). Tendo em conta as limitações dos dispositivos móveis
(em termos de processamento, memória e conedtividade), não é possível trazer diretamente
soluções pensadas num paradigma cliente / servidor para este tipo de dispositivos. Contudo, há
que se considerar os desenvolvimentos em ambiente web e desktop de forma a trazer elementos
para o desenvolvimento aqui pretendido.
Embora com contribuições importantes em cada uma das áreas, e podendo-se combinar
tecnologias de indexação GIS e de caching numa mesma arquitetura (e.g. é possível ter um
sistema GIS associado a um sistema de cache web), verifica-se que não existe uma solução
única e sobretudo especializada a ser executada em dispositivos móveis, mas sim unicamente
aplicável em ambientes web onde a conetividade é garantida.
Este estado da arte se dedicou à investigação de requisitos, formalizações matemáticas,
análise e proposição de algoritmos para tratar de conteúdos georreferenciados em termos de
caching. Também se dedicou ao estudo teórico e matemático de implantanção de soluções
estudadas aplicáveis a dispositivos móveis, que tendo por base o estado da arte terão de ser
adaptados e / ou propostas novas abordagens para irem ao encontro dos requisitos e
necessidades do Xplore Market.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

O estado da arte de um determinado assunto diz respeito a produção técnica e teórica


acumulada sobre um tema até determinado período.

Definidas como de caráter bibliográfico, elas parecem trazer em comum o


desafio de mapear e de discutir uma certa produção acadêmica em diferentes campos
do conhecimento, tentando responder que aspectos e dimensões vêm sendo destacados
e privilegiados em diferentes épocas e lugares, de que formas e em que condições têm
sido produzidas certas dissertações de mestrado, teses de doutorado, publicações em
periódicos e comunicações em anais de congressos e de seminários. (Norma Sandra
Ferreira de Almeida, 2002, p. 1).

Herrera (2013) ressalta que o estado da arte mostra como os principais conceitos e
métodos têm sido tratados nas pesquisas existentes, servindo para nortear motivações e
limitações do desenvolvimento atual pretendido. Ao investigar como autores têm tratado
conceitos, algoritmos e métodos, atualiza e inspira o desenvolvimento atual pretendido. Tal
forma de investigação inclui o levantamento bibliográfico de um assunto feito de forma
intencional e sistemática.

Também são reconhecidas por realizarem uma metodologia de caráter


inventariante e descritivo da produção acadêmica e científica sobre o tema que busca
investigar, à luz de categorias e facetas que se caracterizam enquanto tais em cada
trabalho e no conjunto deles, sob os quais o fenômeno passa a ser analisado. (Norma
Sandra Ferreira de Almeida, 2002, p. 1).

A revisão bibliográfica requer uma estratégia de busca, de sistematização e de análise


dos trabalhos (Inoue, 2015). O levantamento do material aqui utilizado teve como norte o estudo
de caching modelos geográficos para aplicações mobile. A busca foi feita na base de dados
Scopus e no Google Scholar a partir de palavras-chave abertas dada a contemporaneidade do
assunto. Nesta etapa, 11 trabalhos científicos foram encontrados. Em seguida, fez-se uma
análise de conteúdo a partir dos Resumos dos textos encontrados. Para a sistematização desse
conteúdo, utilizou-se a estratégia de coding, isto é, a aplicação de rótulos que descrevessem a
informação compilada.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

Coding is analysis. To review a set of field notes, transcribed or synthesized, and


to dissect them meaningfully, while keeping the relations between the parts intact, is the
stuff of analysis. This part of analysis involves how you differentiate and combine the
data you have retrieved and the reflections you make about this information. (Miles &
Huberman, 1994, p. 56).

O primeiro coding foi descritivo e permitiu nomear as questões tratadas a respeito das
estratégias de caching e aplicações no turismo. Filtraram-se 06 textos. A partir desse universo
mais amplo, identificou-se as categorias necessárias para discussão do desenvolvimento aqui
proposto. Por fim, 05 textos foram selecionados para compor esta revisão bibliográfica e foram
lidos na íntegra. Eles aparecem na Tabela 1. Nela fica evidente a complementação entre as
categorias propostas e as palavras-chave encontradas no texto.

Tabela 1. Categorias de análise, temas e palavras-chave da bibliografia selecionada

Coding Título Palavras-chave


Caching, arquiteturas Proactive Retention Aware Caching cache storage, cloud computing, proactive
e algoritmos retention aware caching e outras.
Caching, arquiteturas Optimum Caching versus LRU and Web cache strategies, optimum caching,
e algoritmos LFU: Comparison and Combined Belady’s algorithm, hit rate, simulation,
Limited Look-Ahead Strategies Zipf distributed requests, least recently
used (LRU), least frequently used (LFU)
Caching, GIS, VegaCache: Efficient and progressive spatio-temporal data management; spatio-
nuvem, web spatio-temporal data caching scheme temporal database; distributed cache;
for online geospatial applications WebGIS; geospatial application; spatial
cloud computing
Caching, GIS, A replacement strategy for a distributed caching; replacement; spatiotemporal;
nuvem, web caching system based on the LRU; networked GIS
spatiotemporal access pattern of
geospatial data
Caching, GIS, Reducing the cloud cost of mobile smartphone; cloud; reverse-geocoding;
nuvem, mobile reverse-geocoding caching; GIS

Intencionalmente, o estado da arte incluiu diferentes fontes científicas no intuito de


abarcar o máximo de desenvolvimentos possíveis, seguindo Souza et. al. (2017).

Por se tratar de uma temática recente na literatura e pouco explorada no âmbito


do turismo, optou-se por não restringir os resultados, utilizando-se a opção: todo tipo
de documento “Articles” ou “Articles in press”, “Journals”, “Book or Book chapter”,
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

“Article or conference paper”, “Conference Review”, “Editorial”, “Business Article”,


“Short Survey” e “Erratum”. Souza V.S., Varum C.M.D.A. & Eusébio C. (2017).

Apesar dos textos levantados aparecerem tanto em revistas científicas quanto


conferências, os selecionados foram apenas oriundos de conferências. A Tabela 2 apresenta as
fontes. Isso reflete a busca pelos desenvolvimentos mais atuais.

Tabela 2. Fontes da bibliografia selecionada

Proceedings of the fifth international workshop on Mobile cloud computing & services
21st International Conference on Geoinformatics

The International Archives of the Photogrammetry, Remote Sensing and Spatial


Information Sciences,
IEEE Conference on Computer Communications

16th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and


Wireless Networks (WiOpt)

O tema vem ganhando importância nos últimos anos principalmente por envolver a
evolução dos dispositivos móveis; destaca-se o trabalho de 2003 que já lidava com essa questão.
Ao mesmo tempo, por ser tratar de uma área tecnológica, desenvolvimentos ocorrem o tempo
todo. Houve uma preocupação que os trabalhos selecionados abrangessem uma linha do tempo
coerente com esse desenvolvimento e evolução. Ao mesmo tempo, vale ressaltar que dos
trabalhos analisados de 2017, nenhum avanço relevante foi identificado. A Tabela 3 também
identifica a origem dos trabalhos, sendo concentrados na América do Norte e China.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

Tabela 3. Categorias de análise no tempo e no espaço

Coding Data Filiação


Estratégias de caching: 2017 Estados Unidos
arquiteturas e algoritmos
Estratégias de caching: 2018 Alemanha, Finlândia e Grécia
arquiteturas e algoritmos

Caching, GIS, nuvem, web 2013 China

Caching, GIS, nuvem, web 2014 China


Caching, GIS, nuvem, mobile 2014 Estados Unidos

Com isso em mente, apresenta-se, a seguir, o levantamento bibliográfico sistematizado


nas categorias definidas na análise bibliográfica:

- Estratégias de caching de modelos geográficos: arquiteturas e algoritmos


- Caching de modelos geográficos, desenvolvimentos em ambiente web e mobile

Cada item se desdobrou em subcategorias que detalham as informações coletadas.


Considerações sobre as questões de caching relacionadas a modelos geográficos são feitas na
seção Considerações Finais e Desdobramentos ao final deste documento.
Por se tratar de informações prioritariamente técnicas, incluindo muitas demonstrações
matemáticas de algoritmos, optou-se pela manutenção literal das explicações dos trabalhos de
forma a garantir que as análises não fossem prejudicadas por possíveis imprecisões comuns aos
processos de tradução. De forma a manter a integridade das equações, capturas de tela foram
feitas a partir dos textos originais sempre que necessário.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

2 Estratégias de Caching: Arquiteturas e Algoritmos

O trabalho de Hasslinger, Heikkinen, Ntougias, Hasslinger, & Hohlfeld (2018) se dedica


à avaliação de estratégias de caching comparando as usuais LRU e LFU com aquela que
envolve o algoritmo Belady. Ao dar acesso rápido a informações de navegação, o caching
precisa otimizar o transporte, qualidade e armazenamento dessas informações internéticas. Tal
avaliação permite avançar nas questões teóricas de caching.
A comparação das três estratégias de caching em ambiente web trata da taxa de acerto
para solicitações Zipf e avaliações trace-based em plataformas F-Secure (Hasslinger et al.,
2018). Os autores ressaltam que os estudos da temática pouco abordam o optimum caching ou
não o tratam no ambiente web. A contribuição dos autores também se destaca ao incluir a
estratégia envolvendo o algoritmo Belady. A Figura 1, apresentando a tabela 1 apresenta as
principais propriedades de cada método.

Figura 1. Tabela 1: Main performance properties of basic caching methods


Fonte: Hasslinger et al., 2018, p. 2.

O Quadro 1 sistematiza as informações necessárias para essa comparação. Apresenta os


métodos discutidos para decisão sobre conteúdo do cache. Destaca os princípios LRU e LFU,
que selecionam o conteúdo mais recente ou mais frequente, respectivamente, e o algoritmo
Belady. Nele, lida-se com hit rate (taxa de acerto), aspecto importante no cache, e que considera
o conteúdo baseado no tempo de busca do próximo objeto. Figura 2 ilustra a execução do
algoritmo.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

The least recently used (LRU) principle, which keeps the M most recently requested items in
a cache of size M, is widely used because of its simple implementation and updates [4][20]. The
currently requested object is always put on top of a stack being organized as a double chained list,
whereas the bottom object of the cache is evicted in case of a cache miss [12]. This leads to low
constant effort for updating the stack per request by manipulating a few pointers belonging to the
requested and the bottom object. On the other hand, the achievable LRU hit rate is low
[12][17][18][21] because relevant information for predicting the next requests is not fully exploited.

The least frequently used (LFU) principle puts the objects with highest request count into the
cache and therefore maintains statistics of past requests per object. If we assume an independent
reference model (IRM) with request probabilities p ≥ p ≥ … ≥ p will collect the most popular objects
in the cache in a long term
steady state behaviour. For cache size M, LFU approaches the hit rate hLFU = p1 + p2 + … + pM, which
is the maximum IRM hit rate for any caching strategy based on past requests.

LFU can be implemented as a sorted list of N items according to the request counts. Updates
of an LFU cache can still be done at constant O(1) effort per request, where only the count of the
requested object is incremented [22]. Variants with limited request count are preferable to pure LFU
in order to adapt to changing popularity of web content. Sliding window LFU is often used as a
variant, which restricts the count statistics to the W most recent requests. Sliding window LFU can
still be implemented at constant effort per request. Score gated LRU strategies [12] provide a flexible
alternative with constant update effort, which include window LFU and other scheme [17].

A completely different approach considers the cache hit rate with full information about
future requests. Then Belady’s algorithm is known to maximize the hit rate [2], which decides about
the cache content based on the time until the next re- quest to each object. In case of a cache miss, the
requested object is put into the cache, if its next request comes earlier than that of a cached object,
replacing the object with longest time until its next request. It is a greedy algorithm, always enabling
the next possible hit in the future request sequence rT.

Therefore the implementation maintains a sorted cache list according to the time index of the
next requests. We assume the future requests rt being available for 1 ≤ t ≤ Tmax, provided by a trace
that has been monitored on a web platform or by a random generator for simulating requests, e.g., for
independent IRM requests.

Together with object ok(t ) being addressed at request rt we store the index Tok+(t ) of the
next request to the same object ok in a second array also for 1 ≤ t ≤ Tmax, where Tok+(t ) = Tmax + 1 if
ok isn’t requested anymore. Figure 1 illustrates the data sets being required for efficient execution of
Belady’s algorithm.

Tok+(t ) is determined by a single complete scan through the future request sequence. During
the scan, the time Tok–(t) of the currently most recent request to each object ok is stored (Tok– = 0 if
there was no previous request to ok). Then Tok+(t ) is fixed for a new request to ok in the scan
backwards at previous time index Tok– and Tok– is then updated to t. In Figure 1, rT+3 is a request
to o at r : To +(T+2) := T + 4; for r : To +(T+5) := T + 6; etc. (p. 2)
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

10

Figura 2. Data set for efficient support of Belady’s algorithm


Fonte: Hasslinger et al., 2018, p. 2.

While the precomputation of Tok+(t ) is executed at constant effort per request, Belady’s
algorithm finally requires O(log(M)) complexity per request for inserting objects into the sorted list
of cached objects, e.g. via heapsort [8][17].

Quadro 1. LRU, LFU e algoritmo Belady


Fonte: Hasslinger et al., 2018, pp. 2-3.

Para comparar as taxas de acerto dos métodos LRU, LFU e optimum proposto, os autores
apresentam o padrão de solicitação Zipf. Tal padrão aparece na literatura como o ideal para
plataformas web.

Zipf’s law assigns decreasing request probabilities z(r) corresponding


to the objects’ popularity ranks z(r) {1, 2, …, N}:

Figura 3. captura de tela


Fonte: Hasslinger et al., 2018, p. 3.

with shape parameter  and a normalization constant . The


measurement studies have experienced Zipf distributed re- quests on a number
of different web platforms with estimations of the shape parameter  in the
range –0.4 ≥  ≥ –1. (Hasslinger et al., 2018, p. 3).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

11

Comparam taxas de acerto das três estratégias de cache. Para essa avaliação, são
considerados traces encontrados no proxy do caching e a plataforma F-Secure. Considerações
sobre isso são reproduzidas nas palavras dos autores.

We consider trace-based evaluation of requests from F-Secure’s


platform [5] as well as analytic and simulation results of IRM requests in the
next section. Trace-based results are shown for an example of about 26 million
requests to 2.09 million keys over a one week period from Oct. 17-23, 2016.
Again, a Zipf-like distribution with shape parameter  –0.75 is observed for the
top-N most popular objects in the trace, where about 50% of the requests are
addressing the top-10000 keys.
The traces were collected from a caching proxy, through which F-
Secure's applications queried a backend database for information on
application files or URLs they encountered. After stripping any personally
identifiable information from the data, a hash was calculated from the object
data, representing a query key. No information that could identify from which
individual client a particular query came from was kept. The traces contain
only a list of hash strings with timestamps. The pool of clients connecting to
the caching proxy consists almost entirely of Android mobile devices, querying
hashes of new and updated Android application files. (Hasslinger et al., 2018,
p. 3).

O Quadro 2 discute a comparação das taxas de acerto.

We compare cache hit rates for optimum caching with LRU and LFU strategies. Each
evaluation includes the complete trace starting from an empty cache. Pure LFU includes the count
statistics over the whole one week trace, which is subject to an inflexibility regarding dynamics on
shorter time scales. Therefore LFU variants over a limited time frame (sliding window LFU) are
more efficient. The time scale ranging from an hour up to a day is experienced to be most relevant
for the dynamics in web request pattern [13][23]. Therefore we add an evaluation for LFU with daily
count statistics, i.e. with counts being reset at the start of each day.

Figure 2 shows how the hit rate is developing with the cache size for the alternative
strategies. On the whole, caching is efficient, such that a cache with size M/N = 1‰ of the catalogue
size N achieves more than 20% hit rate in all cases. However, there is a large gap of up to 15% hit
rate visible between LRU and LFU and a gap of about half the size between LFU and optimum
caching. LFU with daily count statistics improves the LFU hit rate with count over the whole week
by only 1-2%. We also studied sliding window LFU [12][17] with count statistics over the K most
recent requests. Then a maximum hit rate was obtained for K  50 000 roughly corresponding to 20
minute of the trace, but the improvement over LFU with daily request count turned out to be
negligible.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

12

The results confirm that LFU based on daily statistics is close to optimum for non-predictive
caching strategies, but still leaves a considerable gap open towards clairvoyant optimum caching.

Figura 4. Cache performance for a one week web request trace


Fonte: Hasslinger et al., 2018, p. 3.

Quadro 2. Comparação das taxas de acerto


Fonte: Hasslinger et al., 2018, p. 3.

Os autores vão então avaliar a taxa de acerto do cache para solicitações Zipf via
simulações. O Quadro 3 apresenta os resultados.

In the sequel, we evaluate the cache hit rate for independent Zipf distributed requests via
simulation. Assuming IRM Zipf request pattern, a caching system is specified by 3 parameters: the
catalogue size N, the cache size M and the shape parameter  of the Zipf distribution. The results give
an overview of the hit rate performance in the relevant range of those parameters. Each simulation is
running over at least 108 requests, where a cache filling phase at the start is ignored. The precision is
validated via 2nd order statistics [12], confirming standard deviations for the simulated hit rates in the
order 10–4. The Markov model of optimum caching is used for an independent check of Belady’s
algorithm [9].

Figure 3 - Figure 6 show results for  = 0, –0.5, –0.75, –1. The case  = 0 refers to a uniform
request distribution, where- as the three other cases are distributed over the range, which is
experienced as most relevant in measurements of web request pattern (–0.4 ≥  ≥ –1). The
performance of optimum caching represents the main new insight of this study, which has not been
considered for web caching examples or Zipf request pattern in basic work in literature
[2][8][9][15][16].

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

13

The LRU and LFU hit rates for uniform request shown in Figure 3 are equal the ratio M/N of
the cache size to the catalogue and mark a common worst case for both strategies. However, predictive
optimum caching achieves much higher hit rates hOpt, which mainly depend on the fraction M/N, with
almost identical curves for N = 103, …, 106 within the bounds

Figura 5. captura de tela


Fonte: Hasslinger et al., 2018, p. 3.

Although uniform requests are a worst case for optimum caching as well, the gain over LRU and LFU
is extreme, up to 37%.

Figura 6. captura de tela


Fonte: Hasslinger et al., 2018, p. 4.

Figure 4 shows results for  = –0.5 which corresponds to a moderate request focus on popular
objects within the relevant range of web request pattern. The hit rates still mainly depend on the
fraction M/N of objects in the cache, leading to bundles of four closely adjacent curves for N = 103 ,
…, 106 with largest deviation for optimum caching in case N = 1000. Com- pared to uniform requests,
the hit rate gap of LRU and LFU towards optimum caching becomes smaller but still opens up to
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

14

30% for LRU and 20% for LFU.

For Zipf distributions with more skewness towards popular objects for  → –1, caching
efficiency depends on both parameters M and N, rather than mainly on M/N as for 0 >  > –0.5. Thus
Figure 5 ( = –0.75) and Figure 6 ( = –1) show two hit rate curves for N = 104 and N = 106 for each
of the three considered strategies. In general, cache hit rates essentially in- crease for all caching
strategies with higher request focus on popular objects. However, LRU hit rates remain far below
optimum caching in all cases of small and moderate cache sizes.

On the other hand, LFU can lower the hit rate gap towards optimum caching for  → –1,
since count statistics on past re- quests are more useful for higher concentration on few popular items.
The results suggest hOpt – hLFU  0.25 + 0.2  as a rough estimate of the gap for –0.5 ≥  ≥ –1.
Extensions for  < –1 are possible, but this range is not relevant for web caching.

Quadro 3. Avaliação da taxa de acerto do cache para solicitações Zipf via simulações
Fonte: Fonte: Hasslinger et al., 2018, pp. 3-4.

Feita a comparação dos métodos, o trabalho de Hasslinger et al. (2018) se dedica a


explorar questões do caching optimum proposto em situação de look-ahead limitado. Segundo
a enciclopédia do free-dictionary, look-ahead é “the common operation of a disk cache, which
is to read more sectors on the disk than are called for by the software. If the next disk read uses
data that follows the last read, then the data are already in memory”. Vão utilizar a ação de
video streams para avaliar o desempenho do cache.

… We focus on scenarios, with non-negligible delay from the decision to evict


data from the cache until the data is overwritten by uploads of new data. Uploads of
video streams usually can last for several minutes. During this time, a decision to evict
a video can be withdrawn when new requests are encountered where the portion of data
is kept in the cache that has not yet been evicted. In this way, limited look-ahead
scenarios can improve the cache performance towards the optimum caching bound.
Other approaches have been proposed to improve cache efficiency using awareness of
optimum caching and Belady’s algorithm [1][6][8].
The derived performance results for a single caching system represent the
essential component in extended studies on distributed and/or hierarchical caching
systems in content delivery via clouds, CDN and ICN architectures [4][10][19][24].
(Hasslinger et al., 2018, p. 1).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

15

Os autores tratam de solicitações futuras, considerando opções de look-ahead para


caching para situações de uploads atrasados, esquemas combinados de caching que explorem
limitações do look-ahead, avaliação da taxa de acerto em look-ahead limitado e, por fim, uma
análise do optimum proposto também considerando o look-ahead. O Quadro 4 discute tais
situações e suas respectivas formalizações matemáticas.

A. Look-ahead options for caching due to delayed uploads


The previous results show a significant advantage of optimum caching over other methods
owing to knowledge of future requests. Often some partial knowledge about upcoming requests is
available and used for prediction and prefetching [1][6]. Within this scope, we evaluate how far a
limited look- ahead can realize part of the optimum caching gain when a fixed number L of the next
upcoming requests are known, which corresponds to a more or less fixed time window.

Such a limited look-ahead can be realized, when there is a delay between a request and the
corresponding upload of content to the cache. During the delay time some “future” requests become
visible which can still be regarded in a replacement decision until the update is actually processed.
Data uploads are initiated only in case of cache misses, when external objects is requested that have
to be retrieved from a server or higher layer cache. Therefore we can always assume a short transfer
delay for web cache updates between data centers, roughly in a range of 0.1s - 1s for small data units.

A considerable number of requests is served by web caches already in short time. Wikipedia
as a popular web site reported peaks of over 50 000 requests per second in 2014 being handled by a
few caching servers [24]. Akamai’s CDN had peak loads beyond 30 million requests per second in
2013 [7], which are distributed over a large web caching hierarchy. Thus caches for popular web
content usually have information about several thousand upcoming requests available until data
is retrieved for an update, even in short transfer delay scenarios.

Moreover, we consider video streaming as a scenario with much longer upload times, when
uploads are synchronized with the video stream to the user. Then data is continuously transferred in
small chunks during the video viewing time, which can last in a range over seconds and minutes up
to an hour [14]. While data chunks of a video in the cache are being over- written, new requests to
the video can be regarded in a look- ahead scenario to stop further replacement. When the replacement
of a one minute video is stopped, e.g. after 15 s, then 75% of the data can still be kept to serve a new
request.

B. Combined caching scheme to exploit limited look-ahead


In order to utilize a limited look-ahead L for improved hit rates, we suggest combining
Belady´s algorithm for those objects with known next request time (< L) with a secondary usual
caching scheme (LFU, LRU etc.) for all other objects.

At a request rT, all objects in the cache with a next request before rT+L+1, i.e. within the look-
ahead window are sorted according to their next request time as shown for Belady’s algorithm in
Figure 1. The next request time of other cached objects is unknown and invalidated, e.g. as 0. After
request rT, the sorted list is updated by reinserting the requested object, if its next request comes before
rT+L+1. The object requested at rT+L+1 is added at the end of the list, if the latter is in the cache with a
previously invalid next request time.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

16

If the sorted cache list of object with valid next request times exceeds the cache size M, then
the object with the farthest next request time is evicted. Otherwise, if the list is shorter than the cache
size, then all objects in the list stay in the cache, whereas the secondary caching strategy (LRU, LFU
or another) is applied to select and update the content in the remaining part of the cache.

C. Hit rate evaluation for the limited look-ahead scheme


We simulate the gain obtainable by a limited look-ahead for the next L requests using a
combined strategy of optimum caching with LRU. We show two evaluations extending the previous
results for the one week trace of Figure 2 and for the Zipf distributed requests with  = –0.5 of Figure
4.

The performance of limited look-ahead is presented for both cases in Figure 7 and Figure 8,
respectively. They include the curves for the optimum caching hit rate and for LRU known from
Figure 2 and Figure 4 as the maximum and the mini- mum. Moreover, four curves are added in
between, corresponding to evaluations of four look-ahead variants with different limit L. In each case
we observe a common effect that caching performance of limited look-ahead

 starts along the optimum caching curve up to cache size M*,


 then only slightly improves with increasing cache size be- yond M*, while sliding
from optimum caching performance towards the LRU curve
 and finally approaches the LRU curve for large cache size.

Figura 7. captura de tela


Fonte: Hasslinger et al., 2018, p. 5.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

17

When optimum caching performance is achieved, the cache is almost filled with objects
whose next request comes in the limited look-ahead region before rT+L+1. Otherwise, many objects
are encountered, whose next request comes beyond the limit. We obtain similar results for
combinations with LFU or another strategy instead of LRU, where the LRU hit rate is replaced by
their curve as the lower bound.

D. Analytical result on optimum limited look-ahead caching


We can derive an analytic result for the cache size M* up to which optimum caching is fully
exploited with limited look- ahead for IRM requests. M* marks the points in Figure 8, at which the
limited look-ahead curves start to deviate from optimum caching. Under IRM, we have geometrically
distributed intervals Ij between requests to the same object oj depending on the request probability pj.
We can compute the probability Pr{Ij ≤ L} that the next request comes within the limit L, the mean
number E(Ij | Ij ≤ L) of requests that such an object oj has to stay in the cache until a next hit on oj.
Both values finally determine the mean number E(n≤ L) of objects with next request before rT+L+1:

Figura 8. captura de tela


Fonte: Hasslinger et al., 2018, p. 6.

If M < E(n≤ L) then the cache is expected to fill up with objects having a valid next request
time within the limit L, such that optimum caching is prevalently applied for updates. We calculate
E(n≤ L) for the four curves with different limits L in Figure 8 based on the underlying Zipf request
probabilities pj : E(n≤ 1000)  12.7; E(n≤ 10 000)  740.3; E(n≤ 30 000)  4328; E(n≤ 100 000)  23
079. Those numbers precisely mark the cache sizes at which each limited look-ahead curve starts
deviating from optimum caching.

The results in Figure 7 and Figure 8 indicate that the proposed combined caching strategy
with limited look-ahead has significant effect for L ≥ 10 000, and in the trace evaluation already for
L ≥ 1 000. The trace is taken from a cache serving about 100 requests per second. Thus a delay of 10
s - 100 s is sufficient to make an exploitation of the look-ahead beneficial. Concluding, the look-
ahead can improve cache efficiency for video streaming, with delays of > 10 s until most of the data
is uploaded, whereas for short transfer delays <1s the look-ahead will be reasonable only for caches
serving huge request rates.

Quadro 4. Sobre o look-ahead e suas formalizações


Fonte: Hasslinger et al., 2018, pp. 4-6.

Considerações sobre a taxa de acerto do algoritmo Belady estão na conclusão do


trabalho. O destaque é para a identificação do método optimum para o caching de video streams
ou arquivos com grandes cargas de solicitações (como o caso de informações geográficas, que
serão tratadas em seção oportuna).
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

18

The comparison of usual non-predictive caching schemes with clairvoyant


optimum caching due to Belady’s algorithm shows large gaps in the hit rate not only in
extreme cases for uniform requests, but also over the complete range of Zipf request
pattern that is confirmed to be relevant for web caching in literature. The results in
Figure 3 - Figure 6 give an overview of the expected hit rate performance and
differences in comparison of optimum caching, LFU and LRU for the independent
reference model. Our trace-based study confirms that such Zipf distributed IRM results
meet request pattern on web platforms.
The evaluation of a combined caching method making use of limited look-ahead
scenarios enabled by delays in cache up- loads show that optimum caching is useful
not only to provide an upper bound on cache hit rates, but can partly be exploited for
caching of video streams and for caches serving huge request workloads. (Hasslinger
et al., 2018, p. 6).

O trabalho de Shukla, A. and Abouzeid, A. A. (2017) se dedica às questões do custo de


armazenamento de caching considerando o tempo que o conteúdo fica disponível. Nomeia essa
questão como retention aware caching. O custo de armazenagem na nuvem e os problemas na
memória flash são as principais motivações da pesquisa. A contribuição dos autores é mantida
literalmente.

Prior work on caching (whether proactive or reactive) does not explicitly take into
account the storage cost due to the duration of time for which a content is cached. This
new problem, which we call retention aware caching, is motivated by two recent
technological developments that are described in the paper: cloud storage rental costs
and flash memory damage. We consider a hierarchical network consisting of a server
connected to a number of cache-enabled nodes, located either at the edge of a network
(e.g. base stations) or in the core of a data center. There are two types of network costs:
storage cost at the caches and download cost from the server. We formulate the problem
of proactive retention aware data caching (PRAC), which minimizes the total cost
subject to the node capacity constraints. We first prove that PRAC is NP-Hard in
general and then analyze PRAC for two cases: (1) linear storage cost, (2) convex
storage cost. We show that PRAC admits efficient polynomial time algorithms when the
storage cost is linear in retention times and caches have a large capacity. Furthermore,
we derive bounds on the performance of PRAC for the case when the storage cost is a
practically motivated convex function. Numerical evaluations demonstrate that PRAC
outperforms other state-of-the-art caching policies for a wide range of parameters of
interest. (Shukla, A. and Abouzeid, A. A., 2017, p. 1).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

19

Ao tratar de um caching proativo, ao invés das abordagens usuais, o desempenho do


caching é impulsionado, com decisões de caching antecipadas. Isso é particularmente
importante no caso de caching de dados em plataformas mobile ou que usem muito tecnologias
em nuvem.

In this paper, we investigate the problem of Proactive Retention Aware content Caching
(PRAC) in a hierarchical network where every content is stored in the cache for a specific duration,
called retention time henceforth. This model for having retention times is motivated by two very
different practical problems, first is the storage cost for renting disk space in cloud networks (e.g. data
centers) and second is the cost incurred due to retaining data in the underlying hardware memory.
These two factors contribute to the storage cost and play a critical role in a data-intensive environment
where reliability is a prime concern. Traditionally, cache network models have largely focused on
minimizing the content download delay while overlooking the storage attributes [grifo nosso].

The first motivation behind considering storage cost can be explained as follows. Cloud
service providers (like Amazon AWS, Google, Microsoft Azure) charge users for the actual usage of
disk space as well as file downloads. Many cloud providers offer CDN as a service.
Caching/replicating a popular file in different data centers minimizes the download cost. This has
motivated the prior work [10] of studying caching in data centers. However, caching a file at multiple
data centers incurs a storage cost at each of these data centers. This cost, which typically depends on
the duration for which a content is stored, has not been considered by previous works.

The second motivation behind considering storage cost is explained as follows. Storing a file
involves writing it to an underlying memory device which could be costly depending on the way it is
written. NAND Flash – the most widely used memory hardware for caching applications – is
marketed with an endurance/lifetime specified in the form of the number of P/E cycles it can sustain
[11], [12]. Thus, writing a content to the memory for a specified retention duration incurs a physical
damage that reduces its lifetime. Right before retention expiration, the content is rewritten by the
virtue of a scrubbing algorithm at the expense of a P/E cycle at each write, thus causing substantial
flash damage with each write/re-write. (See [13], [14] to read more about P/E cycles, retention times,
flash failures and reliability.) Prior work [11] minimized the device damage for an isolated content-
centric cache where the authors found that the optimal retention times are proportional to content
popularity.

Quadro 5. Caching pró-ativo


Fonte: Shukla, A. and Abouzeid, A. A., 2017, pp. 1-2.

O Quadro 6 sistematiza o trabalho desenvolvido pelos autores.

In this work, we study proactive content caching to minimize the sum of download and
storage costs in a hierarchical multicast-enabled cache network as shown in Figure 1. Typically, the
servers in such a cache network relay data to the leaf node either by employing a unicast transmission,
such as in a wired network (e.g. data centers [1], [2]) or by multicasting data to all the leaf nodes at
once. The multicasting mode of transmission becomes a natural choice in wireless edge caches [15]–
[17]; it is also used in hybrid networks, such as data centers, where both wireless and wired links
coexist [18].
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

20

Server

pm

1 …

Figura 9. The network consists of a server housing a library of M files and a set of N nodes equipped with caches. A
multicast transmission from the server is received by every node and user cloud, whereas the multicast from a node can
locally serve the demands of the associated user cloud only.
Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 1.

Earlier works in this area did not consider storage cost. Our work may seem related to [3]
which proposes proactive data caching to minimize the content-retrieval delay (download cost),
however, it does not consider multicast for serving files. Authors in [16] propose multicast-aware
proactive caching to minimize the download cost but they did not consider storage cost. Another
related work in [17] assumes that every content can be stored only for a fixed duration time frame
and there is a cost associated with reloading the cache in the beginning of a time frame. While [17]
may appear very similar to our work, there are notable differences. First, their assumption regarding
the lifetime of a content chunk being a time frame is too stringent; we generalize this by allowing
contents to be stored for any duration in the multiple of time slots during a time frame. Second, their
assumption of cache reload cost being a random variable is not in accordance with the recent
literature that suggests that a deterministic increasing linear/moderately-convex function appropriately
fits the storage cost [13].

Quadro 6. Proposição caching pró-ativo


Shukla, A. and Abouzeid, A. A., 2017, pp. 1-2.

O Quadro 7 apresenta o modelo do sistema discutido e a formulação do PRAC.


A. System Model
We consider a time-slotted operation where each time slot is of duration dT units. Starting
from time t = 0, a time frame consists of T slots, t = 1, 2, ..., T . Files are requested in every slot
whereas they are proactively fetched only at the beginning of every time frame. Let pmn denote the
probability of requesting file m at node n in a given slot (assumed independent over slots) and
let [T ] denote the set of slots in the frame, then the model allows proactively caching files at t =
0 for any number of slots T . The length T of the frame is usually determined by the periodicity
in the request patterns. Our model is motivated by [3] which exploits the periodicity in user
demands to proactively cache the content ahead of time (typically when the server is not busy).
Owing to the temporal periodicity in user demand fluctuations, it is reasonable to consider a time
frame equal to a day [3].
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

21

The cache network consists of a server and N nodes (see Figure 1).2 The server contains M
files, where for the ease of exposition every file is of unit size (as in [17], [20]). We denote the set
of N caches by [N ] and set of M files by [M ]. The caches have finite capacity; cache n [N ] can
contain at most Bn files.

We now state the key assumptions used in our model.

Content demands: We assume that file request probabilities are known/can be predicted ahead
of a time frame [4], [8], [9]. We assume that file m is requested at node n in each time slot
independently with probability pmn.

Delay intolerant service: We assume delay intolerant service, i.e. a file requested in a slot
will be served in the same slot. This assumption is relevant for media files Proactive caching: We
assume that the nodes can proactively cache files in the beginning of each time frame for any number
of slots ≤ T (and not during the frame, i.e. 1 ≤ t ≤ T ). This assumption is motivated from [3] which
advocates proactively caching files when the server is not busy (say, at midnight) owing to periodicity
in demands.

• Multicast mode of transmission: If a file requested at slot t [T ] is present in the cache then
it is served instantaneously with no additional cost; otherwise, the node forwards the request to the
server. A server multicasts all the files that are requested in a slot as the traffic is delay- intolerant. A
multicast transmission of a file is received by all caches including the ones that have not requested the
file (see Figure 1).
• Storage cost: Let ymn [T ] denote the retention duration defined as the number of slots
for which file m is stored at cache n starting from t = 0. Storing a file for duration y in cache
incurs a storage cost g(y) R +, g(0) := 0. We assume g(.) to be an increasing linear/convex
polynomial function in this work motivated by [11], [13].
• Download cost: As all the files are unit-sized, we assume that a unit download cost is
incurred on multicasting every file over a time frame of T slots.

B. PRAC formulation
As the cache network timeline is divided in time-frames of duration T time slots and the file
request probabilities are assumed to be stationary, we formulate the objective only for the duration of
a time frame.

The objective function consists of download and storage costs. The download cost is
calculated as follows: In the tth slot of a time frame, the probability that the server receives at least
one request for file m from any of the caches becomes,

Figura 10. captura de tela.


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 3.

where the product is over all nodes that do not contain the requested file at time t. Here ynm < t
implies that the file m is no longer present in the node n as it had been stored only for the beginning
ymn time slots in a time frame. P (m, n, t) is the probability with which the server multicasts file m
in the tth time slot. Thus P (m, n, t) summed over all the time slots in a frame also represents the
total download cost since we assume that each multicast carries unit cost. Furthermore, the storage
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

22

cost during a time frame can be expressed as the summation of α g(ymn) over the files and nodes
where α [0, 1] is the weight of the storage cost relative to the download cost. Then the objective of
PRAC, assumed to be the average of the storage cost and the download cost over a time frame,
becomes the following:

Figura 11. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 3.

C. The hardness of PRAC


We prove that PRAC is NP-Hard even when the storage cost g(.) is linear in retention times.
Theorem 1. PRAC is NP-Hard.

Figura 12. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 3

Remark: The above result is not surprising since various problems on optimal caching
exhibit the property of being NP-hard [16], [22], [23] or even inapproximable [16].

Quadro 7. Modelo do sistema e a formulação do PRAC


Fonte: Shukla, A. and Abouzeid, A. A., 2017, pp. 2-3.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

23

O Quadro 8 apresenta a discussão dos autores sobre o PRAC para um armazenamento


linear – “we show that PRAC admits polynomial time solutions in this case when nodes do not
have cache capacity constraints” (Shukla, A. and Abouzeid, A. A., 2017, p. 3). Detalha as
formalizações matemáticas de algoritmos, teoremas, e proof sketch.

In this section we formulate PRAC under the assumption that the storage cost of file m at
cache n is linear in retention time, i.e. g(ymn) = ymn3.

Then the cost objective with linear g(.) can now be expressed as:

Figura 13. captura de tela


Fonte: Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 4.

We investigate IPa in two steps. First, we remove the cache capacity constraint (3). Under
this setting, IPa decouples over files and is equivalent to minimizing objective function individually
for each file. We then prove a key structural property of optimal retentions. This property enables us
in reducing IPa to an Integer Linear Problem (ILP). We characterize that this ILP can be efficiently
solved using a simple threshold based rule for large caches. Finally, for the capacity-constrained
caches, we propose a heuristic algorithm to assign files to caches respecting their capacity constraints,
given the optimal file retentions for the corresponding uncapacitated case.

A. Optimal retentions for a large cache


In what follows we assume that all caches are large (or uncapacitated). Typically data centers
have huge storage units so from a practical perspective one can consider that the storage capacity is
unlimited [1], [2].

We remove constraint (3) from problem IPa thus assuming that the caches may contain as
many files where each file is stored for their retention time. Removing the capacity constraint
simplifies the analysis as now the problem decouples over files and can be independently solved
for each file m [M ]. Under this setting, the objective function for a given file m [M ] can be
expressed (where we slightly abuse the notation by reusing pnm = pn, ynm = yn) as:

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

24

Figura 14. captura de tela.


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 4.

Proof sketch. We prove this by making use of exchange argument where we show that in
the optimal solution if yi > yj for some pair i < j then by exchanging yi and yj the overall cost
can be reduced leading to a contradiction. ...

Theorem 2 proves that the optimal retention times (for a particular file across the caches)
preserve the order on request probabilities, i.e. higher the request probability, more likely is the file
to be retained for a longer time in the cache. We emphasize that the order specified in Theorem
2 depends only on the request probabilities and is independent of other parameters such as α, T .
Moreover, this characterization also applies for convex polynomial storage cost considered in Section
IV.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

25

Figura 15. captura de tela.


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 4.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

26

Figura 16. captura de tela.


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 5.

Proof sketch. The above theorem is a consequence of linearity of the objective function IPc
and Theorem 2. …

Figura 17. Timeline of cache evolution for N = 3, M = 1


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 4.

The optimal solution of IPc for each file can be computed in O(N log N ) time from Theorem
3, where sorting the request probabilities take O(N log N ) time. As discussed in the beginning of
Section III-A, with linear storage cost and large caches, the formulation of PRAC decouples over files.
Thus, an optimal solution of PRAC, for all files, with linear storage cost and unbounded cache
capacities can be computed in O(MN log N ) time.

Recall that so far we have solved the problem of determining optimal retention times assuming
large cache capacities. Next, we propose a heuristic to compute retention times for the case with linear
storage cost and finite cache capacities.

B. Cache capacitated prefetching with optimal retentions


Since PRAC is NP-hard in general and even for linear storage cost with finite cache capacities
(See Section II-C), we cannot guarantee efficient solutions for such a case. We thus present a heuristic
in Algorithm 1, which we call Fill- Cache, which proceeds via iterations on the set of nodes that
violate the cache capacity constraints.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

27

Figura 18. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 5.

Each iteration of Fill-Cache can be intuitively explained as follows: If the set of nodes U
violating cache capacity constraints is non-empty then remove that particular file from the nodes in
U (i.e. set its retention to 0 for all the nodes in (U ) such that the cost La(y) is the least after
removing it (or, in other words, agrees the most with the overall cost La(y) of the prior state).
Note that computing La(y) is equivalent to computing the sum of Lc(y) for each file.

Fill-Cache is polynomial time, O(MN), as each iteration requires computing the effect of
removing at most M files and there are at most N iterations.

Thus far we analyzed the problem of minimizing the sum of linear storage and download cost
for a caching application where caches can prefetch data for serving requests over the time frame.
Although the problem is NP-hard for linear storage cost, we proved that when caches have large
capacities, an optimal solution can be computed in polynomial time and characterized the structural
properties of resulting retention times. Further, we proposed a heuristic algorithm for the case of finite
capacity caches.

Quadro 8. PRAC para armazenamento linear – algoritmos, teoremas e proof sketch


Fonte: Shukla, A. and Abouzeid, A. A., 2017, pp. 4-5.

O Quadro 9 trata dos algoritmos e teoremas relacionados à análise da situação em que


os custos de armazenamento “is given by a convex, increasing polynomial function of retention
times” (Shukla, A. and Abouzeid, A. A., 2017, p. 5).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

28

Figura 19. captura de tela


Fonte: Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 5.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

29

Figura 20. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 6.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

30

Figura 21. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 6.

Observe that Theorem 4 gives an upper bound, γ, which is a function of only T and d.
Figure 3 shows that the bound is close to 2 for convex polynomials of degree 2; for convex
polynomials of small degree it increases approximately linearly. The values do not vary significantly
across α. Note that our bound is very good (a factor of 2 or 3) for practical cases when α is small and
the storage function is not very convex (has a degree at most d = 2 or 3) [13]. Further note that the
bound evaluated in simulations is close to 1 for all the cases presented in Figure 3 illustrating that our
rounding algorithm produces an integral solution with cost very close to the cost of

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

31

Figura 22. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 6.

(see the proof of Theorem 4), which for well-spread values of y_ and large values of T tends to be
close to 1.

Figura 23. captura de tela


Fonte: Shukla, A. and Abouzeid, A. A., 2017, p. 6.

C. Cache capacitated prefetching with optimal retentions


Unlike the linear storage case, where the retention times were either 0 or T , with the convex
storage cost function, the optimal retention times may take values in the set 0, 1, . . . , T . However,
we continue with (the same) Algorithm 1 as proposed in Section III-B for filling the caches respecting
their capacity constraints. See further results from numerical evaluations in Section V-B.

Quadro 9. algoritmos e teoremas


Fonte: Shukla, A. and Abouzeid, A. A., 2017, pp. 5-7.

As considerações decorrentes do trabalho são mantidas nas palavras dos autores.

We considered the problem of proactive retention aware caching (PRAC) motivated by


the applications where storage cost is critical to the performance of the cache
network. We formulated PRAC for a hierarchical network, with the objective to minimize
the total cost subject to the node(s) capacity constraint(s). We first proved that PRAC is
NP- Hard; and then analyzed it for both linear and convex storage costs. We showed
that PRAC admits efficient polynomial time algorithms for large caches when the
storage cost is linear in data retentions. When cost is a convex function, we derived
bounds on the solution, that proved that our solution is close to the optimal. Our
numerical evaluations demonstrated that PRAC outperforms caching policies that are
retention unaware for parameters of interest. (Shukla, A. and Abouzeid, A. A., 2017, p.
9).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

32

3 Caching de Modelos Geográficos

O caching é utilizado para um armazenamento temporário de informações de forma a


se ter acessos mais rápido a elas, agilizando o processo de navegação do usuário.
A mesma característica de aplicações GIS que impacta a sua indexação, o fato de lidar
com grandes volumes de dados, afeta o seu sistema de caching. Lidar com ele, por isso, é
fundamental em ambiente mobile e que incluam tecnologias de nuvem.

3.1 Desenvolvimentos em ambiente web com tecnologia em nuvem


Zhong, Fang, & Zhao (2013) propõem um esquema de caching para dados espaço-
temporais denominado VegaCache. Sua principal contribuição é poder lidar com a exigência
de acessos em tempo real típico de aplicações online geoespaciais. Os autores o explicam tendo
a função de uma “grande memória” (Zhong et al., 2013). Os autores também desenvolvem
modelos ORM (Object Replacement Model) e LCM (Localized Caching Model) que visam
melhorar a eficiência dos padrões de acesso daquelas informações. O VegaCache é
desenvolvido também em plataforma de nuvem.
Os autores discutem o VegaCache à luz dos atuais sistemas existentes para lidar com
cache de informações geográficas, ressaltando a lacuna que o proposto preenche.

The desiderata for the backend data management infrastructure used in geospatial
applications include high I/O throughput, dynamic scalability and efficient spatio-temporal data
access.

The Spatial Database Management Systems (SDBMS) are served as the state-of-the-art data
infrastructure for geospatial applications. The big spatio-temporal data and numerous concurrent
accesses have posed grand challenges on backend SDBMS due to its intolerant disk I/O latency.
Firstly, web- based geospatial applications are characterized as data- intensive and access-intensive.
The SDBMS cannot provide high I/O throughput due to the limited storage performance of the
underlying DBMS. Secondly, most of the existing methods are based on single-node SDBMS; as a
result, their access efficiency cannot be improved as the data and concurrent users increase, due to
poor scalability. Thirdly, most of the geospatial applications are read- and write-heavy access patterns,
and the I/O access latency becomes the performance bottleneck while processing geospatial
computation. The spatial indexes and data retrieval patterns of SDBMS are tuned for disk rather than
memory cache. Since the disk seek operation is costly, I/O overload is unavoidable in SDBMS while
there are numerous concurrent accesses for geospatial data.

Motivated by the above deficiencies, in this paper, we propose VegaCache, a novel


distributed in-memory spatio-temporal data caching scheme, to meet the real-time access
requirements of online geospatial applications. VegaCache takes advantage of elastic physical
resources of shared-nothing commodity cluster, and it serves as a “big memory” from applications’
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

33

perspective. Each node contributes a tunable memory pools, and VegaCache composes a larger
memory by combining these pools together via a network switch. VegaCache provides elastic
memory resource that varies with data volume by adding or removing cluster nodes, and its memory
capacity can be increased linearly with larger clusters. The active geospatial data are resident in
VegaCache for serving online applications, while the disk-based SDBMS is worked only for data
backup and recovery.


In addition, we implement VegaCache middleware on top of the cloud platform, and
Memcached is adopted as the sever- side daemon into which the data is loaded. The data objects are
distributed and manipulated from client-side, and the ORM and LCM implementation details are
encapsulated into VegaCache runtime system and served clients as the client library. As memory
becomes cheaper and network capacity increases, VegaCache could be scaled out with larger cluster
and faster hardware. It serves requests by fetching data from the memory pool rather than disk pages,
and hence it could eliminate I/O latency while there are numerous concurrent accesses.

Quadro 10. VegaCache e os sistemas atuais


Fonte: Zhong et al., 2013, pp. 1-2.

O modelo de object replacement (ORM) pretende determinar quais dados devem ser
considerados na memória cache; já o modelo localized caching (LCM) lida com a redução da
sobrecarga da transferência de dados (Zhong et al., 2013).

Moreover, we designed an Object Replacement Model (ORM) to determine


which data should be loaded into or swapped out from the distributed memory cache.
The access frequency of data object is counted periodically and the top- ranked objects
are marked as “hotspot” data with respective ratings. Once an object is determined as
hotspot data, its geographically adjacent objects are dynamically loaded into
VegaCache and then clients are requesting data from memory cache rather than disk,
which can be greatly improv the access efficiency without disk I/O latency. Since the
memory capacity is much smaller than the data volume, the effective use of memory
resource is of significance. Thus, the low-rating data are marked as “victim” data and
will be swapped out of memory.
Besides, to reduce data transfer overhead, we design a Localized Caching Model
(LCM) with consideration of geographic proximity and geospatial data access patterns.
The adjacent geospatial objects are persistent in the sequential disk pages of cluster
node in terms of their spatio-temporal proximity. LCM guarantees that the data in the
local disk are firstly resident in the same memory pool maintained by VegaCache. If the
local memory is used up, the data will be transferred to a remote memory pool. The
spatio-temporal data objects are grouped together in terms of geographic location and
each object is assigned a unique identifier as the key. The data objects belonging to the
same region will be organized into the same cache pool. VegaCache is transparent to
geospatial applications, and it provides unified interfaces for clients. (Zhong et al.,
2013, pp. 1-2).
A Figura 24 mostra a arquitetura do sistema proposto com seus dois módulos

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

34

constituintes, um gerenciador de dados e outro de cópia e recuperação. A descrição é mantida


nas palavras dos autores.

Our scheme bridges the gap between access patterns of geospatial


applications and provisions of cloud computing cluster. It takes advantage of
storage and computing capabilities of cluster nodes. Likewise, VegaCache is
composed of memory pools and its capacity is the total size of all pools. The
online geospatial applications are served by VegaCache buffer manager, and
geospatial data are stored in memory represented as binary key-value objects
via distributed caching middleware. Furthermore, since memory is no longer
a scarce resource and its capacity can be extended with more nodes, all of the
online geospatial data could be resident in the big memory of VegaCache. To
guarantee the data integrity and fault-tolerance, we have designed spatio-
temporal data backup and recovery function by using the cloud storage system.
The disk-based SDBMS and NoSQL database (i.e., HBase) are adopted as the
long-term data preservation system for geospatial applications.
Since the geospatial access pattern is read many write once, the
frequently accessed data in VegaCache have three replicas in the backend
storage system. The new data are updated into memory pool with write-ahead-
logging (WAL) mode, in other words, the data is written into the memory pool
after logging process behaviors. Once the VegaCache cluster node has failure
problems such as network partition and power off, the updated data will be
recovered by redoing the operation logs, and then sends the latest data to the
nodes where the replicas are preserved. Finally, the newest data will be
reloaded into VegaCache for serving requests.
VegaCache has characteristics of being distributed, layered and
loosely coupled. ... VegaCache buffer manager serves the I/O-heavy workloads
that require real-time response and low latency, whereas the data backup and
recovery module is served for offline access and fault-tolerance. (Zhong et al.,
2013, p. 2).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

35

Figura 24. System Architecture of VegaCache


Fonte: Zhong et al., 2013, p. 3.

Os autores explicam o mecanismo que envolve a plataforma de nuvem e garante a


resposta em tempo real do sistema proposto.

As shown in Fig.1, the data backup and recovery subsystem is built on top of
cloud data management infrastructure, which contains NoSQL database and SDBMS.
We use HBase, a leading NoSQL database based on the Hadoop cluster, to preserve the
big spatio-temporal data and semi-structured data such as imagery data and textural
Point of Interest (POI) data. Since HBase only has support for key- based query
predicate, it cannot process spatio-temporal queries, which limits its capability. To meet
the requirements of geospatial applications, we have designed a bidirectional
transmission tunnel to exchange data between HBase and SDBMS. The spatio-temporal
query requests are redirected to SDBMS for further spatial computation and the query
processing results are transferred to the VegaCache buffer. The Memcached server is
widely used in web applications. Each VegaCache node manages one or more
Memcached server and a Memcached daemon maintains one memory pool, i.e.,
VegaCache is composed of many memory pools, and the ORM and LCM models are
integrated into the buffer manager module. From the applications’ perspective,
VegaCache is working as the distributed caching middleware that manages the memory
pools and bridges the gap between Memcached and spatio-temporal access patterns.
(p. 3)
The geospatial application clients interact with VegaCache middleware via
OGC-compatible spatial access interfaces. VegaCache provides a unified global view
for geospatial applications, and the details of data distribution, request scheduling,
remote memory management and storage organization that are transparent to
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

36

applications. The clients can utilize the big distributed cache just they call the local
memory cache. With the data residing in the memory, applications are not necessary to
access disk, and the access latency can be greatly reduced without disk I/O. Hence our
scheme could guarantee real-time response to meet the requirements of online
geospatial application services. (Zhong et al., 2013, p. 3).

Como o VegaCache indica uma solução que alavanca a técnica de caching para objetos,
é necessário o desenvolvimento de um modelo que lide com esses objetos para aplicações
geoespaciais. Os autores discutem esse modelo, com o Quadro 11 apresentando as notações que
serão utilizadas para o respectivo desenvolvimento. As formulações matemáticas também são
apresentadas e descritas nas palavras dos autores.

… To make full use of physical memory resource, data objects that will not be accessed in
recently should be swapped out of VegaCache. We present a data replacement model specifically for
geospatial application, and it takes characteristics of spatio-temporal data and geospatial access
patterns into account. The traditional Memcached caching methods are designed for web applications
with simple data models such as textual and structured records, whereas the spatio-temporal data are
semi-structured, high dimensional and complex, our scheme should bridge the gap between key-value
object caching model and geospatial applications.

Since the Memcached server provides simple key-value data models, the spatial and temporal
information is encrypted into the KEY by computing the Hilbert curve value of data objects, while
the VALUE object is the real data content of raster image data and vector-based feature represented
by Well Known Binary (WKB) and Well Known Text (WKT) format. Table I shows a reference of
the basic notations used in this paper, and they will be explained at their first appearance.

Notation Definition
The total capacity of distributed memory cache pools
 in VegaCache.
N Number of cluster nodes.
Total Number of spatio-temporal objects in the
 distributed cache.
P The number of memory pools of cluster.

i The node indicator i-th node ( 1  i  N ).


j The j-th memory pool on cluster.
The geographic area that maintained by the r-th
Rr
memory pool ( 1  r  j )
k The k-th spatio-temporal object ( 1  k   ).
mj The size of memory pool on j-th node.
ni The number of memory pools on i-th node.
sk The size of data object ( min  sk  max ).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

37

The value of d-order Hilbert curve, an object is


d  x, y,t  indicated by two-dimensional spatial location (x, y)
and timestamp (t).
The network bandwidth of cluster, which is a
B constant determined by communication switch.
 A constant of threshold size of sub region data.

Figura 25. Table i. Basic notations


Fonte: Zhong et al., 2013, p. 3.

The access behavior of clients will be traced and recoded into logging files. VegaCache
daemon threads will analyze the log and periodically count the access frequency of data objects, and
hence determine the hotspot geographic region and sort the ratings of accessed data objects
dynamically. The VegaCache capacity is depended on the number of cluster nodes and number of
respective memory pools on each node. It is formulated as (1):

Figura 26. captura de tela


Fonte: Zhong et al., 2013, p. 4.

The memory cache capacity of VegaCache is the total size of all memory pools on cluster
nodes. A memory pool maintains one geographic region. In our proposal, the whole map layer is
partitioned by quadripartition strategy, i.e., a region is divided into four sub regions until their size
reached to a given threshold. The threshold is an empirical value, which is determined by data volume
size and memory pools of cluster. The spatio-temporal objects are sorted with space filling curve, and
Hilbert curve is chose for its preservation of spatial proximity. The Hilbert value of objects is used as
the KEY in VegaCache memory pool, and it is computed by d x, y,t , where x, y,t
represents the spatial location and timestamp. The d represents order value of Hilbert function, and
its numeric range is formulated as (2):

Figura 27. captura de tela


Fonte: Zhong et al., 2011, p. 4.

practical applications, the d value is computed while the data are imported into cloud storage
systems. The spatio-temporal data are stored as <KEY, VALUE> objects in the data blocks.
Typically, the block size is determined by trade-off of geographical area ( Rr ) and memory size (
), and the larger block is beneficial to disk-based distributed file system, whereas VegaCache is
discriminated in favor of smaller block.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

38

According to Tobler’s first law of geography, the clients are vulnerable to access the
geographically adjacent data objects and different clients may access different areas. If more sub
regions are loaded into memory cache, more concurrent users will be served by VegaCache, and
hence it could improve the whole access efficiency with most of the accessed data are pre- fetched
into distributed cache.

Let W denote the whole map layer, and its geographic region is represented by its Minimum
Bounding Rectangle (MBR), i.e., MBR W . The spatio-temporal data are partitioned
into smaller data fragment, and their size is close to the threshold (i.e., ). However, the geographic
area is not split evenly, and the split process is recursively executed until all sub regions are satisfied
the threshold. The data objects are evenly distributed across VegaCache memory pools in terms of
their geographic regions, and each memory pool maintains objects belonging to a region. The
dynamical spatio-temporal data replacement algorithm is described as Algorithm 1:

Algorithm 1 Spatio-temporal data replacement procedure


1: Compute the access frequency ratings of objects and return the Hash Map list represented by
HM objID,ratings
2: Choose objects whose ratings are ranked at top from access log list. 3: Determine the hotspot
regions in terms of the total access frequency of their
containing objects. The returned regions are denoted by Rh , where the h
containing objects. The returned regions are denoted by Rh , where the h

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

39

Figura 28. captura de tela


Fonte: Zhong et al., 2013, p. 4.

7: Reevaluate the access frequency ratings periodically, and the in-memory objects will be
swapped out while their ratings are decreased to the last 1/N
hen select the objects ranked at top N and load them into distributed memory pools of
VegaCache, and return to Step 3 for further processing.
8: VegaCache dynamically replaces the caching objects in it and maintains high availability to
serve heavy concurrent request workloads. The VegaCache daemons are still on running state to
guarantee real-time response.

Quadro 11. Modelo para lidar com objetos para aplicações geoespaciais
Fonte: Zhong et al., 2013, pp. 3-4.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

40

Os autores também precisam desenvolver um modelo de caching localizado de forma a


lidar com a transferência de dados de forma a dar efetividade para o VegaCache. O Quadro 12
esquematiza o modelo; sua formulação matemática e considerações do desenvolvimento são
apresentadas nas palavras dos autores.

The access latency is only dependent on the network bandwidth if the data are resident in
VegaCache caching pools. Thus, to reduce the latency, the effective way is to avoid data transfer via
network. We propose a localized caching model for VegaCache to balance the tradeoff among access
efficiency, data transmission and cache hit rate. Above all, since the access speed of memory is in
general orders of magnitude faster than that of disk, the system could provide lower latency access
for concurrent requests if there are more data objects are resident in the memory cache. However, the
memory pool capacity on a single node is very limited, and the data transfer via network is a time-
consuming process because the network latency is even more than disk I/O latency, which may
counteract the faster access benefits from the memory cache. Hence, VegaCache should transfer data
objects to memory pools of remote nodes as less as possible. We tuned the localized caching models
from two facets, which include writing data objects into distributed cache and reading data from
memory pools.

As shown in Fig.2, the data objects are firstly written into the local memory pool. If all local
memory pools are full, the data will be sent to the remote memory cache by executing data replacement
procedure as in Algo.1. Typically, a cluster node has ni ( i indicates node ID) memory pools, and each
of them maintains data within a geographically adjacent region. Thus, the concurrent clients who
access the adjacent objects could be served by memory pools on the same node, which reduces access
latency for it needs not to transfer data from remote nodes over the network. To cope with a single
point of failure problem caused by too many clients requesting data from one node, VegaCache
distributes the requested data in terms of geographical proximity to avoid single-node overload.

Furthermore, if the remote memory pools are not enough, the data items should be swapped
out from local memory pool with Least Recently Used (LRU) algorithm, and the latest data objects
will be written into local memory pools.

Figura 29. The workflow diagram of writing data into VegaCache


Fonte: Zhong et al., 2013, p. 5.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

41

As shown in Fig.3, VegaCache improves the read efficiency and guarantees real-time
response by retrieving data from the memory cache. It firstly determines the node location and
memory pool by computing the C ash function. If the data objects are in any memory pool, then
return the requested data to clients. Otherwise; if a cache hit miss occurs, VegaCache will load data
objects from the underlying disk-based storage system and pre-fetch the adjacent data into distributed
cache buffer for later access, finally return data objects to clients.

Figura 30. The workflow diagram of reading data from VegaCache


Fonte: Zhong et al., 2013, p. 5.

Figura X. captura de tela


Fonte: Zhong et al., 2013, p. 5.

Figura 31. captura de tela


Fonte: Zhong et al., 2013, p. 5.

Quadro 12. Modelo de caching do VegaCache


Fonte: Zhong et al., 2013, pp. 4-5.
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

42

O Quadro 13 apresenta as informações da avaliação do desempenho do VegaCache em


termos de sua taxa de acerto, de sua taxa de acesso, de resposta em geral, de eficiência na
recuperação de dados

Experiment Environment
The experiments are conducted on a 8-node cluster, and each of them has two quad-core Intel CPU
2.13GHZ, 4GB DDR3 RAM, 15000r/min SAS 300GB hard disk. All nodes are deployed with CentOS
5.5, Memcached-1.4.15, SDBMS (i.e., PostGIS-2.0), Hadoop-1.1, HBase-0.94 and LoadRunner. The
spatio-temproal dataset is about 450.7 GB, which contains real raster data and vector data.

Cache Hit Ratio of VegaCache


We evaluated the cache hit ratio of VegaCache with different memory capacity, and each node is
allocated 512MB, 1GB, 1.5GB, 2GB, 2.5GB and 3GB memory, i.e., the memory capacity of VegaCache
ranges from 4GB to 24GB. The benchmark sends data requests randomly across whole geographic
region. As shown in Table II, the cache hit ratio of VegaCache could be improved with larger memory
capacity, and it increases from 52.87% to 98.53% when the distributed cache size is enlarged from 4GB
to 24GB. Additionally, with the data volume becoming larger, our scheme could improve the cache hit
ratio by adding more physical memory and cluster nodes, which guarantees most of data objects resident
in the distributed cache and hence reduce disk I/O latency.

Table ii. Cache hit ratio of vegacache

Memory Capacity of VegaCache (GB)


Test item
4GB 8GB 12GB 16GB 20GB 24GB
Cache Hit
Ratio (%) 52.87 68.92 79.21 87.33 93.06 98.53

Access Throughput Performance


To confirm the effectiveness of supporting concurrent users, we evaluate the access throughput in terms
of Transactions per Second (TPS) measurements and use Loadrunner to emulate two hundred users
requesting for geospatial data service. The access patterns are determined by read and write ratio, which
ranges from 10%-100%. As shown in Fig.4, the throughput performance of SDBMS with the local cache
is about 1.58-2.37 times better than that of SDBMS without cache, and the TPS of VegaCache is 12.78-
17.39 times more than that of SDBMS with local cache. Moreover, the throughput of VegaCache
increases more progressively with larger memory capacity. Its average performance is improved by
1.78-2.98 times better when the memory capacity is enlarged from 4GB to 24GB.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

43

Figura 32. Access throughput performance


Fonte: Zhong et al., 2013, p. 6.

Response Performance
We evaluated the response performance with different numbers of concurrent users, and VegaCache is
compared to SDBMS with and without cache during experiments. With real geospatial application
workloads, the average response time is logged under different circumstances.

Figura 33. Average response performance


Fonte: Zhong et al., 2013, p. 6.

As shown in Fig.5, the average response time becomes longer as the number of concurrent users
increases. The response performance of SDBMS is very low whether there is a cache or not, and the
average response time of SDBMS with cache is 23.8%-137.2% less than that of without cache, which
is about 400-6389 milliseconds. On the contrary, the average response time of VegaCache is kept stable
at about 68-580ms, and the response performance is even better with a larger memory, which is
improved by 1.5 orders of magnitude than SDBMS. Thus, VegaCache could respond to a large number
of concurrent clients in milliseconds, which provides real-time access efficiency for online geospatial
applications.

Data Object Retrieval Efficiency


We evaluated the data retrieval performance by retrieving ten groups of data within different geographic
regions. The regions are equal to 5k k 1, 2,3,10 percent of MBR of map layer, which are denoted
as Rk k 1, 2, ,10 . As shown in Fig.6, VegaCache outperforms SDBMS by about 4.47-9.76 times
in all test cases. Moreover, since the cache hit ratio will be improved with larger memory pool, the
average data retrieval performance of VegaCache is improved by about 13.6%-66.9% with the cache
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

44

size increased by 4GB, e.g., VegaCache has achieved the retrieval efficiency by 2.35-3.71 times better
when its capacity increases from 12GB to 24GB. VegaCache performs at a relatively stable level no
matter what size of the requested regions. In contrast, the retrieval performance of SDBMS is reduced
heavily while accessing larger region data. Additionally, the experiment results show that VegaCache
has good scalability, and its retrieval performance can be improved by adding more cluster nodes and
physical memory resources.

Figura 34. Data Retrieval Efficiency


Fonte: Zhong et al., 2013, p. 6.

Quadro 13. Avaliações de desempenho


Fonte: Zhong et al., 2013, pp. 5–6.

Sobre as avaliações de desempenho, os autores ressaltam algumas características do


VegaCache cruciais para lidar com aplicações GIS.

… With the help of cluster technology, VegaCache can easily handle millions
of requests per minute and respond to clients in less than ten milliseconds,
which is imperative for reducing access latency in online geospatial
applications that require intensive I/O operations. Moreover, the cache hit rate
will be improved with more physical memory resources and a larger cluster.
The experimental results show that the geospatial processing efficiency could
be improved by several orders of magnitude and the access efficiency is
increased by more than one hundred times as well. (Zhong et al., 2013, p. 2).

Ao concluírem o trabalho, os autores destacam a possibilidade de aumento da


capacidade do sistema e reforça as três técnicas que ele dispõe que o tornam um esquema com
boa escalabilidade, disponibilidade, tempo de resposta apropriado e eficiente.
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

45

It consists of three key techniques to guarantee real- time access, i.e., scalable
architecture of caching systems, data replacement model and localized caching
models. The capacity of VegaCache could be enlarged by simply adding more
cluster nodes and physical memory resources. Moreover, VegaCache could
improve its access efficiency when all of the online data are resident in the
memory pool and minimize data transmission across the cluster. … The results
show that VegaCache has good scalability, high availability, real-time
response and low-latency access efficiency. (Zhong et al., 2013, pp. 6-7).

Também considerando a plataforma em nuvem, Li, Wang, & Shi (2014) apresentam
uma estratégia de cache com desenvolvimentos específicos para lidar com a intensidade dos
dados espaço-temporais de aplicações GIS. A estratégia cache replacement trata de tiles
(blocos de dados de imagem [wikipedia]). A explicação da proposição dos autores é trazida em
sua forma literal.

… the key issue to cache replacement for tile is to study a combination method which
can consider and balance both temporal and spatial locality features in tile access, not
only adapting to changes in access distribution of hotspot, but also reducing the
frequency of cache replacement. [O trabalho] presents a distributed cache replacement
method based on tile sequence with spatiotemporal feature in access pattern. The
method builds a LRU stack to storage current hot tiles and their popularities based on
the sequential feature when tiles are accessed, then structures tile sequence which
embodies both temporal and spatial locality in hot-tile access; moreover, chooses a
right tile sequence to be replaced. (Li et al., 2014, p. 134).

Ao explicar o método para desenvolver sua estratégia de caching, os autores mostram o


comportamento do tile em navegação web, o relacionamento de tiles com informações
relacionadas a mapas e também com capacidades de memória.

The advantage of a pyramid model for tile is to reduce the access times to hard
disk, improving access efficiency (Zhang, 2004). Let a tile with coordinates (txℓ,)t,y
,where tx, ty represent the coordinates of the tile block as a unit and ℓ is the layer number
in the tile pyramid model. Client calculates the coordinates of the center tile which is in
the current browsing view by the location of the center and its longitude and latitude,
then request tile by coordinates (ℓ, (tx,ty)) to server. When user navigating, multiple tiles
are included in the current browsing view for a client, called as browser window.
However, although geospatial tile navigation and Web navigation have certain
similarities and the operation of browsing a map can be considered as similar to
following a Web page hyperlink, the user can simultaneously browse multiple tiles
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

46

rather than browsing one single Web page. Thus, tiles which have neighboring access
time are sequential in the spatial access pattern. For example, when the size of the
current browsing view is a*b and user requests tile (ℓ, (tx,ty)), the server will send a tile
sequence whose length is a*b to the client to display. However, tiles in the sequence
have spatial location correlation and their accesses time are also sequential.
Cache uses the static random-access memory (SRAM) for caching data, which is
distributed in a matrix. When CPU is calling data from cache, row decoder fixes the row
address firstly, then column decoder fixes the column address. Thus it fixes a unique
memory address of requested cached data, and transfers it to data bus by RAM interface.
Generally, SRAM reads data efficiently by a burst mode, which locks a memory row
when reading data, then quickly swept out all memory in different columns, reading all
bits of data on the column at the same time. Thus, the burst mode can improve the
efficiency for reading memory. Since a tile sequence has the spatiotemporal locality
when they are accessed, if store tiles in a sequence into consecutive memory, the tiles
which have the neighboring accessed time (such as tiles in a browsing view) have the
neighboring memory location in cache. It can use the reading characteristic of cache in
the burst mode to minimize the response time when user is roaming in NGIS. (Li et al.,
2014, p. 134).

O desenvolvimento da estratégia envolve o algoritmo Optimal Page Replacement


(OPT). É ele que vai incorporar os tiles relacionados a informações geográficas.

These data which will not be accessed permanently or for a longest time should
be replaced in Optimal Page Replacement Algorithm (OPT) (Sun, 2004). LRU
algorithm is closest to OPT in algorithm efficiency. It or its varieties are used widely,
such as in Google, World Wind, Microsoft Virtual Earth. LRU usually structures a stack
to input the accessed data from top to bottom. In this way, the top data always are most
recently accessed, and the bottom data always are the least recently accessed.
This paper structures LRU stack to embody the spatiotemporal locality for tile
sequence for cache replacement strategy. There are two key issues to be considered to
structure a LRU stack. One is how to use LRU stack to express the spatiotemporal
locality and spatial relationship between tiles. The other issue is how to use LRU stack
to balance the temporal and spatial locality in tile access pattern to get a better
replacement decision-making.
We divide the LRU stack into three portions to implement the tile sequence
structure and replacement operation. The first portion is tile receiving-pool. It receives
the latest accessed tiles and filters the tiles with higher access probability, preventing
them into the stack bottom. The second portion is tile sequence-pool, collecting all
accessed tiles which will be structured into sequences and be replaced in the near future.
The space size of the pool is floating from 0 to BUFFER_MAX. When the number of
tiles in the sequence-pool is BUFFER_MAX, it triggers the process of structuring tiles
into different tile sequences which have different lengths. The third portion is
replacement-queue-pool, to store and sort tile sequences which will be replaced. (Li et
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

47

al., 2014, pp. 134-135).

A contribuição dos autores é incorporar ao processo a substituição da “piscina de tiles”;


as condições em que isso acontece são mantidas nas palavras dos autores.

For a strict LRU stack, when a new request to tile B arriving, it should place the tile at
the top of LRU stack. It brings overhead cost and disadvantage to the manager for tile
sequences in our strategy. A tile, which is hit in a tile sequence, should be moved to the
stack top and resort all tile sequences. It will increase difficulty for tile sequence
generation and sort in the sequence-pool and replacement-queue-pool. This paper
proposes a new method to solve the problem. When a tile in the stack is hit, its access-
flag is marked as NEW, instead of moving to the stack top; when a tile isn’t hit, the
method will check the tile on the bottom of receiving-pool. If the tile’s access-flag is
NEW, then it is moved to stack top and its access-flag is marked as OLD; if its access-
flag is OLD, then the tile is moved to sequence-pool, to empty a space for tile B in the
receiving-pool. When the size of sequence-pool is BUFFER-MAX, then begin
structuring the tiles in the receiving-pool into different sequences, move these sequences
to replacement-pool and sort them. When the cached tiles exceed the replacement
threshold value, executes the cache replacement algorithm to replace the tile sequence
at the bottom of replacement-pool. Thus, the method can use LRU stack well while
reducing movements of tile. (Li et al., 2014, p. 135).

A eficiência do cache tem que ver com a sequência escolhida do tile e da sequência da
substituição. Os autores discutem a estratégia de usar um algoritmo de substituição e como
fazem isso está nas palavras dos autores.

The advantage of tile sequence is to improve the reading efficiency for cache. It
tries to put the spatiotemporal-related tiles into neighboring cache, to read data by only
one CPU instruction and respond the user’s request more quickly. Thus, the generation
of tile sequence and the sequence replacement method are all based on this idea. It tries
to replace the tiles which have neighboring cached location meanwhile, to get
successive caching space for subsequent hot accessed tiles. Actually, the replaced tiles
also have an access-spatiotemporal-locality. However, how to structure tile sequence
which has above characters and how to sort tile sequences in the replacement queue by
their storage values, it is the key issue of our proposed replacement method.
The accessed hotspot is changed and users have different operations when
having different roaming aims at different time. Thus the lengths of different tile
sequences with access-spatiotemporal-locality are different. This paper proposes a

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

48

method to generate tile sequence using the length of a tile sequence as a weight to
measure caching cost. It is a cache replacement algorithm based on tile sequence and
its size (SS cache replacement algorithm). [grifo nosso]
When it is triggered to generate multi-tile-sequence which has different length,
SS begins traversing each tile in the sequence-pool and sorts the tiles which have the
same row in cache into a sequence. Thus, the caching locations of tiles which are in the
same sequence have the same row and neighboring locations. The SS also assigns an
H-value for each sequence, which is inversely proportional to its caching cost.
According report (Young, 1994), data which has a lower H-value (that is has a higher
caching cost) will be replaced firstly, and it is considered as an optimal online caching
replacement method (Jin, 2000). Thus, the longer tile sequence will be replaced out
cache firstly [grifo nosso]. Meanwhile, avoiding keeping the shorter sequence which
hasn’t been accessed for a long time in the cache, the method will add the value of
caching cost by a time weighting factor T, to replace the shorter sequence at a proper
time. We use the H-value of the latest replaced tile sequence to mark T-value. When a
new sequence Seq is added to replacement queue, its H-value is set as Equation1.
H(Seq) = T + 1/size(Seq) (1)
size(Seq) is a function to get the number of tiles in tile sequence Seq, that is to
get the size of Seq. The tile sequences which are generated at the same time in the
sequence-pool have the same T-value to calculate their H-values, that is they have the
same sequence-generation-time. According to the H-value, new generated sequences
are sorted with the sequences already in replacement-queue-pool. The sequence with
lower H-value moves to the stack bottom, to be replaced firstly. Thus the latest accessed
longer sequence will have a higher H-value by the continuous expanded T-value, and
avoid to be replaced before the shorter sequence which have a lower H-value and
haven’t been accessed for a long time. Thus, function H(Seq) balances the temporal
locality and spatial locality for sequences in the replacement queue. (Li et al., 2014, pp.
135-136).

O Quadro 14 mostra os resultados das simulações realizadas com a estratégia proposta


e outros métodos que consideram a substituição de caches.
The proposed cache replacement strategy for DHCS was simulated and compared with the classic cache
replacement strategy which are used widely, such as FIFO,LFU,LRU, TAIL. The simulation was
performed in a networking simulation environment. Six distributed high-speed cache servers were
connected using a 1,000-Mbps switch to form a fast Ethernet. The DHCS cache capability is the
important factor in cache replacement strategy. The relative size of the cache (RSC) is the ratio of the
cache size to the total size for the tiles requested. Therefore, the RSC was varied in simulations carried
out to assess the performance in terms of cache hit rate, the average request response time.

Cache hit rate


The cache hit rate gives an indication of validity of cache replacement. The cache hit rate is the ratio of
the tiles which cache response to the total size for the tiles requested. Figure 1 shows a comparison chart
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

49

of the tile-request cache hit rate of the five strategies for different relative cache sizes. From figure 1, it
shows the cache hit rate of any strategy increases with the cache size. FIFO is the one which always has
the worst cache hit rate; LRU (which embodies temporal locality to tile access) is better than LFU (which
embodies spatial locality to tile access); TAIL (which embodies both temporal and spatial locality to tile
access) is better than FIFO, LFU and LRU. However, SS is a bit better than TAIL, although they all
consider both temporal and spatial locality to tile access. It’s because they have different methods to
embody access-spatial-locality. SS uses tile sequence, in which tiles have space- relativity, to embody
access-spatial- locality, while TAIL uses the accumulated access time. The method in SS follows spatial
characteristic of tile.

Figura 35. Increased percentage of the tile-request cache hit rate


Fonte: Li et al., 2014, p. 136.

Average request response time


Figure 2 shows the average request response time of any strategy decreases with the cache size. As the
simulation of cache hit rate, FIFO has the worst performance while SS has the best one. TAIL is the
second one and LRU is similar as LFU. Figure 2 is the decreased percentage that SS is compared
with FIFO,LFU,LRU and TAIL. SS is 6% to 14% lower than TAIL, is 7% to 19% lower than
LRU, is 10% to 21% lower than LFU, and is 12% to 23% lower than FIFO. It shows SS has bigger
advantage than other strategies in average request response time. It indicates that SS using the
physical characteristic of cache can accelerate the access response for users.

Figura 36. Decreased percentage of the tile-request average request response time
Fonte: Fonte: Li et al., 2014, p. 136.

Quadro 14. Resultado das simulações


Fonte: Li et al, 2014, p. 136.
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

50

Os autores sumarizam o esquema proposto, listando os principais elementos


constituintes, para tornar a estratégia de substituição de cache efetiva.

This paper considers the fixed characteristics of tile access pattern, uses the physical
characteristics of cache burst mode in reading data, generates tile sequence which has
spatiotemporal locality based on a LRU stack, balances the temporal and spatial
locality in tile access pattern by a weighted method, to sort the tile sequences in
replacement queue to make effective cache replacement strategy. (Li et al., 2014, pp.
136-137).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

51

3.2 Desenvolvimentos em ambiente mobile com tecnologia em nuvem


O último desenvolvimento de cache discutido neste estado da arte aborda a necessidade
de redução do uso da bateria e os custos de armazenamento no dispositivo de aplicações
location-based a partir da nuvem. Phan, Baek, & Guo (2014) discute sua solução de cache ao
lidar com 1.1 milhão de fotos na nuvem contendo informações geográficas. Afirmam que “To
the best of our knowledge, ours is the first work that addresses this problem and offers a general
and open solution” (Phan et al., 2014, p. 15).
Os autores vão tratar de serviços de reverse-geocoding a partir de um dispositivo de
telefone inteligente; a explicação é mantida nas palavras dos autores.

… reverse-geocoding, the process of converting a geographic coordinate, commonly


expressed as a latitude & longitude pair, into a real-world physical location. A
smartphone can obtain its geocoordinate either through GPS or cellular/Wi-Fi
trilateration, and then reverse-geocoding will perform the additional step of providing
a human-readable name for that geocoordinate, often at multiple resolutions. For
example, the geocoordinate of 37.79522 latitude & -122.40296 longitude, both in units
of decimal degrees, can be reverse-geocoded to the physical location of “600
Montgomery Street, Financial District, San Francisco, California, United States.”
(Phan et al., 2014, p. 15).

Tal processo facilita, por exemplo, serviços que usem a localização real do usuário via
dispositivo móvel, como o algum de fotos do telefone que organiza as informações por local ou
um guia turístico digital que informe ao usuário a proximidade com algum marco histórico
(Phan et al., 2014, p. 15).
O maior consumidor de bateria, nesse caso, é o dispositivo GPS. Como os autores lidam
com isso é mantido em suas palavras.

Reducing power used by location-based applications on constrained mobile


devices is an important topic that has been discussed in other work. GPS is the usual
culprit, and other researchers have looked for solutions, such as inferring user behavior
from lower-power sensors like the accelerometer to intelligently trigger GPS (e.g. [5,
23]), observing user speed (e.g. [13, 7]), and offloading functionality to servers (e.g.
[14]). Our work is complementary; instead of addressing GPS, we identify a higher
battery-cost operation, reverse-geocoding, that can be called repeatedly by
applications, and we apply our effort at reducing this consumption. Other work has
looked at data-mining interesting locations obtained from GPS traces (e.g. [22, 3])

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

52

using offline desktop computation. Here, we use the user’s geolocality to perform
caching performed entirely on the smartphone.
We look to partition a metric space into regions, a goal similar to machine
learning classifiers such as SVM or Decision Trees. Our feature space is very small
(only two dimensions), reducing the need for a sophisticated classifier on a battery- and
memory-constrained phone. (Phan et al., 2014, p. 16).

Os autores fazem algumas considerações sobre as dificuldades relacionadas à bateria do


dispositivo móvel lidar com reverse-geocoding, que usualmente precisa se conectar com a
nuvem. Os autores citam alguns exemplos de sistemas já existentes.

While real-world locations can be a great benefit, performing reverse-geocoding


comes at high cost on smartphone platforms, where applications must send a
geocoordinate to a cloud server to get the reverse-geocoded result. On some cellular
networks, making the invocation can consume much battery power. For example, using
a Samsung Galaxy S IV smartphone (released in April 2013) on AT&T’s 4G LTE net-
work outdoors in the San Francisco Bay Area, we measured a reverse-geocoding call
to consume 710 mW, while using GPS to get a geocoordinate – a battery-hungry process
in itself – consumed less at 391 mW. As a result, for scenarios where knowing the
physical location many times during the day would be useful, high battery usage makes
repeated reverse-geocoding server calls prohibitively expensive.
An alternative would be to run the same reverse-geocoding process on the
smartphone, an approach that requires sufficient cartographic data to be stored in order
to perform accurate lookup [21]. Proprietary commercial applications such as those
from Garmin and TomTom allow the user to purchase and install maps for offline usage,
consuming 100s of MBs to several GBs of storage. (Phan et al., 2014, p. 15).

Os autores vão propôr, então, uma abordagem para caching localidades a partir de
reverse-geocoding. Para isso, vão avaliar três esquemas: convex hulls, limites radiais e hashing
cartográficos. O desenvolvimento é testado na organização de fotos em um dispositivo móvel,
usando como fonte o aplicativo Flickr.com.

First, we exploit user geolocality by caching reverse-geocoded locations


obtained from server calls. The key challenge is then to define and manage reverse-
geocoded geospatial regions on the phone in a manner that enables high accurate hit
rates. In our work we evaluated three schemes to achieve that purpose: convex hulls,
radial bounds, and cartographic hashing. Second, our system can propagate this data
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

53

by preemptively filling the user’s cache with precomputed regions that are pushed to
phones.
We evaluated our system in the context of mobile photography, where reverse-
geocoding can help organize photos. We used Flickr.com’s public API and downloaded
metadata of over 1.1 million geotagged and timestamped photos that were taken
specifically with smartphones and conducted experimentation to determine the accurate
cache hit rate, a measure that estimates the joint probability of hits in the cache that are
also ground-truth correct. Our results show that for the photography-oriented user
traces obtained, our system reduces the number of cloud server calls by over 70% for
neighborhood granularity and by over 85% for city granularity. Finally, we show that
the scheme offers substantial smartphone local storage as well, with our system
occupying less than 1 MB of storage using our hash encoding even for a complete
precomputation of the San Francisco Bay Area. (Phan et al., 2014, pp. 15-16).

O Quadro 15 apresenta informações que permitem compreender o processo de reverse-


geocoding, suas terminologias, seu funcionamento em aplicativos existentes. O quadro dá base
para se entender o desenvolvimento da técnica de caching proposta. Também são feitas algumas
considerações sobre indexação de dados espaciais.

Geocoding converts a human-readable query for a street address or landmark into a latitude &
longitude coordinate. The process involves canonicalization of the query, searching through an index
of ground-truth locations, and producing the associated geocoordinate. In the absence of an exact
match to any known location, interpolation is performed to estimate the geocoordinate.

Reverse-geocoding naturally performs the opposite function. Given a geocoordinate as input, its
output is a named location at multiple resolutions. For example, the result can describe a location by
its name, street address, neighborhood, county, city, state, and country.

In this paper we focus only on reverse-geocoding, which we distinguish from mapping, the process
of projecting a geo-coordinate onto a graphical map. We further concentrate on reverse-geocoding
to neighborhoods and cities/towns (although we will extend our work to street-granularity in the
future) because it already enables many application scenarios that do not have a need for mapping or
routing at a street level. Instead, such applications rely only on named locations and thus may make
repeated reverse-geocoding requests throughout the day, such as:
• Audible guidance for the blind: Assistive applications can vocalize human-understandable
location names for visually-impaired users living in a city (e.g. [2]).
• Tourism: Visitors who are within the vicinity of a landmark can be informed what they are
near (e.g. [4]).
• Life-logging: Applications can perform continuous life- logging into a diary so users can
review what neighbor- hoods and towns they have been through (e.g. [8]).

On-device reverse-geocoding challenges


To perform reverse-geocoding effectively, ground-truth spatial data must be collected accurately and
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

54

exhaustively to describe physical locations. Spatial indexes such as R-Trees [11, 21] are then built to
enable logarithmic search time. Companies providing mapping data, such as Navteq (owned by Nokia),
Tele Atlas (owned by TomTom), and Google, invest heavily in street-level mapping, and because
mapping is considered a killer feature for modern smartphones, collected data is kept proprietary for
competitive advantage.

The nature of the data and the spatial index provides insight into why offline reverse-geocoding is
problematic. First, and obviously, accuracy is impacted by the amount of data stored. If a location is
missing from the index, then a search must settle for either nearby objects or a coarser- grained
description. Further, unlike geocoding where the resulting answer is a geocoordinate with continuous
numbers, missing real-world locations cannot be as easily interpolated. As a result, providing accurate
reverse-geocoding results improves with more data. Since phones can be space- constrained, keeping a
finely-detailed index is challenging.

We note that at the time of this writing, some commercial (non-free) offline mapping applications are
available for iOS, Android, and Windows Phone, including those from Nokia, Garmin, and TomTom.
The offline map data can span from 100s of MB to several GB depending on the region. It is important
to note that while some mapping applications listed in the app stores may state that the software is free-
of-cost and occupies space on the order of 25 MB, the user must purchase and download the map data
separately. This approach does not exploit geolocality of a user who very rarely leaves his metropolitan
region, and as a result may unnecessarily consume phone storage. Also, some applications sell the map
data as a subscription, incurring ongoing charges. These storage and monetary costs inform our decision
to find a non-proprietary, OS-agnostic, and low-storage solution.

Finally, updating the index may produce inefficient trees. Because cartographic data may change often
due to urban construction, the index structure must change as well, leading to poorer performance
compared to indexes built from scratch. For this reason and others, updates to offline mapping
applications almost always require that the user download entirely new data, which incurs a non-
negligible delay.

Cloud-assisted reverse-geocoding challenges


Given the space requirements for detailed cartographic data and the need to perform updating of the
spatial indexes, mapping and reverse-geocoding services are most commonly accessed through
online cloud services that can leverage the availability of TB-sized data stores (for example, Earth
data provided by the open-source OpenStreetMap is about 330 GB in raw, unindexed form [15, 17]).
On Android and iOS, there exist native API for performing reverse-geocoding by calling cloud servers,
and in the absence of such dedicated API, programmers can still perform reverse-geocoding through
RESTful HTTP calls to Google[10], OpenStreetMap Nominatim [16], and other services.

The resulting problem is that these online calls are costly in terms of battery power. For example, in our
photography context, suppose the user is out taking photos, such as on a weekend or vacation trip, and
wants to view his grouped albums several times during the day; invoking a server-side process for
reverse-geocoding lookup of one or a few photos at a time would then incur increased power
consumption. We quantified this expenditure using a Samsung Galaxy S IV smartphone, where we
measured reverse-geocoding invocations to Nominatim over 4G LTE to consume, on average, twice the
power of GPS. While battery consumption due to wireless data usage is an issue for any cloud-
connected mobile app, it is especially problematic for the types of applications we consider since by
their nature, the user is often out of the range of Wi-Fi and still needs to make repeated reverse-
geocoding requests during the day.
In addition to battery use, other potential problems are:
• Calling the server incurs a wireless hop delay and a server processing delay.
• Using cellular data will incur wireless provider monetary charges, depending on the user’s
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

55

data plan.
• The reverse-geocoding service provider may enforce a quota limit that repeated requests may
exceed.

Finally, from the viewpoint of the reverse-geocoding service provider, fulfilling requests from
potentially thousands or millions of mobile app users each day can be burdensome in terms of CPU
load.

Quadro 15. Reverse-Geocoding


Fonte: Phan et al., 2014, pp. 16-17.

Os autores discutem, então, a abordagem proposta, ressaltando seu principal objetivo:


minimizar o armazenamento do dispositivo móvel e os acessos ao servidor de forma a consumir
o mínimo possível de bateria nesta transmissão de dados. A Figura 37 esquematiza a estratégia
de caching proposta a partir de dois aspectos das informações geográficas conforme
apresentado pelos autores no Quadro 16.

 We perform caching of geospatial regions whose ground- truth location names are obtained
from reverse-geocoding servers (running OpenStreetMap Nominatim). We implemented three
schemes and ran our caching on Android. Our system is described in subsection 4.2.

 We also explore the opposite end of the communication versus storage tradeoff: instead of
lazily storing visited regions, we can precompute hashed boundaries and preemptively push them to
the phone. This preemptive propagation is described in subsection 4.3.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

56

Figura 37. Caching algorithm decision flow


Fonte: Phan et al., 2014, p. 18.

In our Android implementation, the cache is maintained mostly in memory with a SQLite
backing store. The cache operates using the logical flow in Figure 2. After an app acquires a
geocoordinate (e.g. via GPS), it can call our component to get the location name. Our system then
checks to see if the geocoordinate is in a cached region that has already been ground-truth labelled
with a name. If there is a hit, then the location name is returned to the application, but note that this
location is inferred from previously-cached data, where the accuracy of the inference is affected by
the region definition scheme, as discussed below.

Quadro 15. A estratégia de caching


Fonte: Phan et al., 2014, pp. 17-18.

Sobre como o sistema acessa o servidor para identificar a informação geográfica e


adicioná-la ao cache, a explicação é mantida nas palavras dos autores.

If there is instead a cache miss, then the system makes a call to a reverse-geocoding
server to get a ground-truth location name. In our work we use OpenStreetMap data
and the Nominatim reverse-geocoding service that we installed on our own Linux
servers. After reverse-geocoding completes, the geocoordinate and its labelled name
are added to the cache. A separate table maintains the name strings; with 1000 unique
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

57

location names taking on average 40 2-byte characters, the lookup table will consume
80 kB. The name is then finally returned to the application. (Phan et al., 2014, p. 18).

A formalização do caching, com suas equações, são apresentadas no Quadro 16, que
mantém as explicações dos autores de forma literal.

Figura 38. captura de tela


Fonte: Phan, T., Baek, A., & Guo, Z., 2014, p. 18.

Equation (1) is the cache hit rate, a traditional measurement for caching strategies. Here, it is
the proportion of reverse-geocoding requests that can be answered (correctly or incorrectly) by the
cache without having to call a server. In our case, although a geocoordinate may be in a cached region,
that matching region may be labelled with a different name than where the geocoordinate actually is.
For example, suppose the neighborhood of Chelsea, New York City, is cached but with a region
defined to be the Earth. A subsequent request for a California geocoordinate would result in a cache
hit, but the inferred Chelsea location would be inaccurate. We use Equation (2) to define this accuracy.

Inaccuracy can be assuaged by two means. First, the user or developer can set the cached
region to be finer, but this adjustment may decrease both the cache hit rate and the accurate cache hit
rate, discussed below. Second, if a user can manually identify a mislabeled location, then the system
can take the same steps as it would for a cache miss, retrieving a ground-truth name from the cloud,
re-labelling the cached geocoordinate, and returning the name to the application. Importantly, we
always assume the worst case where a server call is made for all inaccurately inferred results.

Equation (3) is the accurate cache hit rate, the joint probability that there is a cache hit and an
accurate result. This metric is extremely informative because it describes the proportion of all reverse-
geocoding requests that can be accurately resolved in the cache without having to call a server. If we
assume that in the absence of the cache that N reverse-geocoding server calls need to be performed,
then we can state that a caching system with an accurate cache hit rate P will invoke the server N (1.0
P ) times.

The key challenge to performing caching of location names is defining and managing the
cached regions in order to enable a high accurate cache hit rate.

Quadro 16. Informações sobre a formalização do caching


Fonte: Phan et al., 2014, p. 18.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

58

Os autores vão, então, apresentar e discutir os esquemas que implementam a gestão do


cache. O Quadro 17 apresenta as estratégias convex hull, que utiliza a indexação com R-tree,
os limites radiais e o hashing cartográfico.

Consider an application that asks for reverse-geocoding for three geocoordinates, shown as
points 1, 2, and 3 on the left-hand side of Figure 3. All three produce a cache miss, resulting in a
cloud-assisted reverse-geocoding call. If all three points share the same location, then we can form a
bounding convex hull around them and assign the hull that location name. Suppose the app then
requests a name for the query point Q that lies within the hull; the query is thus a cache hit, with the
point being labelled with hull’s inferred location. Note that since Q is inside the hull, it is not added
to the cache. Further, suppose that the application acquires another geocoordinate, shown as point 4.
Since this point lies on the outside of the hull, it is a cache miss and thus requires a cloud-assisted
lookup. Now, if this point happens to have the same location name as the other points 1, 2, and 3,
then the convex hull can be extended to cover point 4.
To efficiently find covering hulls, we place hull points in an R-Tree index. For H hulls with
M points on average, this approach runs in O(lg(HM ) + M ). An advantage is that potentially few
points are needed to define a large region.

Figura 39. The three geolocation caching schemes we evaluate: convex hull (left); radial bounds (center); and
Cartographic Sparse Hashing (right).
Fonte: Phan, T., Baek, A., & Guo, Z., 2014, p. 18.

Our next region definition and management scheme, shown in the center of Figure 3, is based
on a query point, a fixed radius r, and the resulting bounding circle that can be formed around the
point. Suppose initially the application has requested points 1 and 2, found them to be cache misses,
and added them and their ground-truth locations into the cache. Now, for a new query geocoordinate
at query point Q, the cache is searched to find the nearest point within distance r to Q. We use an R-
tree to store the points, and to compute distances, we use a Haversine estimation for two
geocoordinates upon a spherical surface (where the sphere has been modeled with the dimensions of
Earth).

In this example, the result is point 1, and so point 1’s location name is assigned to the
query point Q. Note that Q is not added to the cache since its city location was inferred, not obtained
through a ground-truth reverse geolocation lookup. If there are N points kept in an index, then this
nearest-neighbor query can be answered in O (lg N ) on average. The disadvantage is that in order to
achieve a high accurate hit rate, many points must be kept in the cache.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

59

Our third scheme uses the implicit formation of square boundaries formed throughout the
geocoordinate space, as shown on the right of Figure 3. Here, the query point Q is hashed to the same
implicit square as point 1, which was already in the cache, and so Q is given the same location label.
Note that Q is not added to the cache.

Each square is formed through Cartographic Sparse Hashing (CASH), our hash algorithm
that takes as input (i) latitude and longitude as 64-bit floats and (ii) a resolution in meters. It then
outputs a 64-bit integer, where the hashed components for the longitude and latitude end up in the
high and low bits, respectively. (Due to space constraints, we defer detailed code for another paper.)
This final hash key is then used in lieu of geocoordinates in the cache. To better compare this scheme
with radial bounds, we define a square boundary based on a radius r that describes a circumscribed
circle around a square whose side is L = 2 r. We use this value of L as the resolution given to the
hashing function.

The advantage of this scheme is that cache search is O(1), but like the radial bounds scheme,
it suffers from needing many cached points to achieve a high accurate hit rate.

Quadro 17. convex hull, limites radiais e hashing cartográfico.


Fonte: Phan et al., 2014, pp. 17-18.

Sobre a estratégia de reduzir o armazenamento necessário para o caching, a abordagem


aparece na Figura 40.

One of our goals was to reduce on-device storage by keeping a small set of
cached regions. Here, we consider what advantages could be gained by relaxing this
requirement by first offline precomputing regions for an entire metropolitan area and
then pushing the results to pre-fill location caches. Such an approach could improve the
hit rate Pr(hit) while not adversely affecting the accuracy Pr(correct hit).
Our approach is illustrated in Figure 4. Consider an arbitrary geometry, shown
by the dark outline, that represents a boundary. We subdivided the region into squares
with r = 500 m, took points every 100 meters per linear side, and performed ground-
truth reverse-geocoding at each point. Homogeneous squares have all points with the
same name and are thus labelled with that name; these are the grey squares in the
figure. Heterogeneous squares occur on the borders.
We then packaged the homogeneous regions (comprising their hashed value
using CASH and their geolocation name) and pushed them to phones to pre-fill the
caches. Because heterogeneous regions are not included, geocoordinates falling into
such regions would be a cache miss. For the entire San Francisco Bay Area covering
9043 km2, this approach produced 32,458 homogeneous squares for city granularity.
At 16 bytes per entry, including the name lookup table discussed earlier, the prefilled
cache occupies 587 kBytes. (Phan et al., 2014, p. 19).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

60

Figura 41. Preemptive precomputed hashes


Fonte: Phan, T., Baek, A., & Guo, Z., 2014, p. 19.

Um dos experimentos realizados no sistema de caching em dispositivo Android


envolveu usuários organizarem imagens captadas pelo aparelho e agrupadas por cidade e
arredores. Aplicou-se, então, o algoritmo. A tabela 1 mostra os resultados em termos de
consumo da bateria; os autores ressaltam que o algoritmo funcionou conforme esperado. As
informações estão no Quadro 18.

We first evaluate our fundamental assumption that repeated reverse-geocoding operations can
be a significant battery drain. As mentioned in Section 3, smartphones commonly perform reverse-
coding by calling a cloud server. Currently in the USA, commercial wireless ISPs such as AT&T and
Verizon offer 3G and 4G, with connectivity over 4G LTE and 4G HSPA providing the highest
bandwidth.

We thus looked to determine the power consumption of making a RESTful invocation to the
public Nominatim server [16] to perform reverse-geocoding. In our tests we used a commodity
Galaxy S IV smartphone and measured its power use outdoors with a Monsoon Solutions Power
Monitor hard- ware power meter. We were very careful to deactivate irrelevant background processes,
turn off the screen,
and keep the CPU awake with an Android wakelock.
Our results are shown in Table 1. Using GPS to acquire a geocoordinate took about 391 mW,
which includes idle CPU power. The power needed for reverse-geocoding over 4G LTE was almost
twice as much at 710 mW, while the slower 4G HSPA consumed on the same order as GPS.

Table 1.
Measured power consumption. GPS was averaged over 300 seconds, while reverse-geocoding was averaged over 30 calls.

We can also estimate the consumed energy (in Watt-hours) following a model where power
is integrated over time. Similar to [18], we found that our test phone exhibited a tail power state where
the LTE radio remains consuming power even after a network invocation ends; as a result, each call
kept LTE on for 10.12 seconds on average.
Projeto em curso com o apoio de:
Xplore market| Entidade Promotora: Parceiro:

61

In our work, we found mobile photographers that took over 250 photos in one day, and for
life-logging applications that take geolocations every 2 minutes, 576 reverse- geocoding calls would
be needed. If we evaluate our power model with 400 such calls per day, we estimate an energy
expenditure of 0.79 Wh, or over 8% of our phone’s 9.88 Wh battery. We measured our fully-charged
phone to last be- tween 12 and 18 hours on average with active use, so our estimation roughly equates
to be between 1 and 1.4 hours of battery life. These observations suggest that having mobile apps
make repeated calls to online servers for reverse- geocoding can cause a substantial drain, especially
since each reverse-geocoding call is preceded by a GPS invocation.

Quadro 18. Resultados do algoritmo


Fonte: Phan et al., 2014, pp. 19-20.

Com isso, os autores afirmam terem abordado as questões relacionadas aos custos de
armazenamento inerente ao processo de reverse-geocoding (que converte informações
geográficas como latitude e longitude em locais reais) em dispositivos móveis e também o
consumo de bateria decorrente do acesso à nuvem. Como desenvolvimento futuro, os autores
indicam trabalhar com caching para resoluções de cenários mais específicos, como a
envolvendo resolução de ruas, e avaliando o uso da bateria por um período maior, como a de
sequência de dias.

In this paper we reduced these costs by exploiting user geolocality and caching reverse-
geocoded locations that were obtained from server calls. Using a 1.1M photo dataset,
we showed that our cartographic sparse hashing scheme reduces the number of cloud
server calls by over 70% for neighborhood granularity and by over 85% for city
granularity. We additionally explored pre-filling user caches with hash results, which
improved the caching hit rate. Finally, we showed that the system occupies relatively
little space, with less than 1 MB of data being used by our hash encoding even for a
complete precomputation of the San Francisco Bay Area. (Phan et al., 2014, p. 22).

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

62

4 Considerações Finais e Desdobramentos


A principal contribuição deste estado da arte é levantar os elementos constituintes para
discussão de caching para modelos geográficos (ou espaciais, ou também denominadas
georreferenciadas, e geralmente indicadas pela sigla GIS).
Os desenvolvimentos estudados tratam tanto de ambiente web quanto ambiente mobile.
Eles apresentam diferentes desafios, em que se destaca o fato de aplicações em dispositivos
móveis carecerem de capacidade de armazenamento, processamento e bateria insuficiente para
tratar de dados geográficos. Por isso, há que se buscar nos desenvolvimentos web e desktop
princípios que norteiem desenvolvimentos mobile adequados.
O trabalho analisou, com detalhes, algoritmos dos principais modelos, aprofundando,
para além das formalizações matemáticas, aspectos a serem considerados por desenvolvimentos
semelhantes. As vantagens e desvantagens, os limites, os contributos, as estratégias de um
conjunto relevante de trabalho e autores permitem ampliar, por si só, ampliar o estado da arte
do tema ao serem articulados em conjunto.
Este estado da arte deixa claro a oportunidade de geração de conhecimento, teórico e
empírico, pelo Xplore no âmbito do caching para conteúdo geográfico mobile: combinar as
tecnologias de caching em uma mesma arquitetura que possa ser executada em ambiente mobile
em situações de conectividade limitada. Desenvolvimentos no turismo ainda estão longe de
abordarem essa questão, o que torna o desenvolvimento do XPlore uma importante contribuição
no avanço do conhecimento da área.

Projeto em curso com o apoio de:


Xplore market| Entidade Promotora: Parceiro:

63

5 Referências

Hasslinger, G., Heikkinen, J., Ntougias, K., Hasslinger, F., and Hohlfeld, O. (2018) Optimum
caching versus LRU and LFU: Comparison and combined limited look-ahead strategies,
16th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and
Wireless Networks (WiOpt), Shanghai, pp. 1-6.
doi: 10.23919/WIOPT.2018.8362880
Inoue, C. R. (2015). Tipos de revisão de literatura. Tipos de Revisão de Literatura, 9. Retrieved
from http://www.fca.unesp.br/Home/Biblioteca/tipos-de-evisao-de-literatura.pdf
Li, R., Wang, X., & Shi, X. (2014). A replacement strategy for a distributed caching system
based on the spatiotemporal access pattern of geospatial data. International Archives of
the Photogrammetry, Remote Sensing and Spatial Information Sciences - ISPRS Archives,
40(4), 133–137. https://doi.org/10.5194/isprsarchives-XL-4-133-2014
Norma Sandra Ferreira de Almeida. (2002). As pesquisas denominadas estado de arte.
Educação & Sociedade , 79(257–272), 257–272. Retrieved from
http://www.scielo.br/pdf/es/v23n79/10857.pdf
Phan, T., Baek, A., & Guo, Z. (2014). Reducing the cloud cost of mobile reverse-geocoding.
15–22. https://doi.o
S. Shukla and A. A. Abouzeid, "Proactive retention aware caching," IEEE INFOCOM 2017 -
IEEE Conference on Computer Communications, Atlanta, GA, 2017, pp. 1-9.
doi: 10.1109/INFOCOM.2017.8057029rg/10.1145/2609908.2609947
Zhong, Y., Fang, J., & Zhao, X. (2013). VegaCache: Efficient and progressive spatio-temporal
data caching scheme for online geospatial applications. International Conference on
Geoinformatics, (2011), 1–7. https://doi.org/10.1109/Geoinformatics.2013.6626103

Projeto em curso com o apoio de:

Você também pode gostar