Você está na página 1de 4

Mawazo

Mostlytechnologywithoccasional
sprinklingofotherrandomthoughts

CountingUniqueMobileAppUserswithHyperLogLog
PostedonNovember16,2014

3Votes

Continuingalongthethemeofrealtimeanalyticwithapproximatealgorithms,thefocusthistimeisapproximate
cardinalityestimation.Toputtheideasinacontext,theusecasewewillbeworkingwithisforcountingnumber
ofuniqueusersforamobileapp.Analyzingthetrendofsuchuniquecounts,revealvaluableinsightsintothe
popularityofanapp.
WewillbeusingHyperLogLogalgorithmwhichisavailableinmyopensourceprojecthoidlaasaJavaAPI.The
stormimplementationoftheusecaseisavailableinmyotheropensourcewebanalyticprojectvisitante.

HyperLogLog
Cardinalityisthenumberofuniqueitemsinalist.Inanaiveimplementation,cardinalitycanbeestimatedusing
memoryproportionaltocardinalitysize.However,whenthecardinalityisveryhigh(e.g.,IPaddress,phone
number),suchanaiveapproachisnotpragmatic.Variousapproximateprobabilisticalgorithmscanbeusedfor
cardinalityestimation.
HyperLogLogisbasedonanalyzingsomebitpatternsinthehashedvalueofanitem.Welookatthelengthofthe
sequenceofmostsignificantzerobits.Themaximumlengthamongsuchzerobitsequencesfromallthehashed
valuesisindicativeoftheuniqueitemcount.
Inreality,toimprovethequalityoftheresult,multipleindependenthashfunctionsareusedandthelongest
sequencezerobitsresultingfromeachhashfunctionisusedtoproducethefinalaverageduniquecountvalue.
Insteadofusingmultiplehashfunctions,wewilluseatechniquecalledstochasticaveraging.Wesetasidea
sequenceofsignificantbitsofthehashforbuckets.Fromtheremainingbitswefindthesequenceofmost
significantzerobits.Forexample,touse256bucketsweusethemostsignificant8bitsforbucketsandthe
remaining24bitstofindthesequenceofmostsignificantzerobits.Foreachbucket,wemaintainacountof
maximumlengthofsequenceofzerobits.

SmallCardinality
TheHyperLogLogalgorithmasdescribedintheoriginalpaperdoesnotworkwellforsmallcardinality.As
suggestedinthepaper,whentheuniquecountfallsbelowathresholdanalgorithmbasedonprobabilistic
propertiesofrandomallocations.
Thecorrectionforsmallcardinalityisincludedintheimplementationinhoidla.However,iftheknowledgeof

smallcardinalityisnotknownapriori,youcoulddosimplehashbasedcountinginsteadofHyperLogLog.

MobileAppUsageData
ByinstrumentingtheSDKcallsoftheapp,usagedataiscreatedwiththefollowing5fields.Thephonenumberis
usedasanidentifierfortheuser.
1. Date
2. Time
3. SessionID
4. Phoneareacode
5. Phonenumber

Aswewillseelater,thephoneareacodeisusedtopartitionthedataintheStormimplementation.Thedataisfed
tostormthroughamessagequeue.Hereissomesampledata
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16
2014-11-16

02:17:48
02:17:50
02:17:50
02:17:52
02:17:54
02:17:56
02:17:56
02:17:57

c080f1fa-6d79-11e4-aa9d-c42c030f8af1
d0997c05-6d79-11e4-a360-c42c030f8af1
cbd4af2e-6d79-11e4-b7e1-c42c030f8af1
d0997c05-6d79-11e4-a360-c42c030f8af1
cbd4af2e-6d79-11e4-b7e1-c42c030f8af1
d0997c05-6d79-11e4-a360-c42c030f8af1
cbd4af2e-6d79-11e4-b7e1-c42c030f8af1
d5f8956b-6d79-11e4-891d-c42c030f8af1

310
408
339
408
339
408
339
213

(310)6121967
(408)4937187
(339)8242149
(408)4937187
(339)8242149
(408)4937187
(339)8242149
(213)7703334

StormTopology
Thestormtopologyarchitectureconsistsofaspoutandtwobolts.Thespoutreadsusagefromamessagequeue.
Asimplemessagequeueabstractionavailableinmyopensourceprojectchomboisused.Itfacilitatesusageof
anymessagequeue.IhaveusedRedis.
Thedataemittedbythespoutisfieldgroupedonareacode.Itsessentiallyhashpartitioningontheareacode.All
thedataforthesamearecodeisprocessedbythesameboltinstanceofUniqueVisitorCounterBolt.Eachbolt
instancemaintainsaninstanceofHyperLogLogobject.Whenanewtuplearrives,itsprocessedby
theHyperLogLogobject.
WhentheUniqueVisitorCounterBoltreceivesaticktuple,itobtainstheuniquecountfromtheHyperLogLog
objectandemitsthetuple(boltID,uniqueCount).
ThetupleemittedbyUniqueVisitorCounterBoltisprocessedbytheUniqueVisitorAggregatorBolt,ofwhich
thereisonlyinstance.Asthenamesuggestsitaggregatesthecounts,whichissimplysumminguptheunique
countsfromthepredecessorboltlayer.Theresultiswrittentoamessagequeue,whichanyclientapplicationcan
consumeforfurthertrendanalysisofuniqueusercountdata.
Theoutputissimplyathetuple(currentTime,uniqueCount).Hereissomesampleoutput.Asnewrecordsare
processed,theuniquecountgrows.Uniquecountisalwaysmonotonicallyincreasing.
1416191997120
1416192007121
1416192017123
1416192027584

76
76
77
78

1416192037125
1416192047126
1416192057126
1416192067128
1416192077128

78
78
78
79
79

TemporalReference
Theuniquecounthasatemporalreferencepointfortimeseriesdata.Thecountingiswithrespecttosomepoint
inpast.Incaseofofamobileapp,itwillbethelaunchdateoftheapp.
Althoughgenerally,thetemporalreferencedoesnotchangeonceset,sometimesitmaybenecessarytochangeit.
Thereisamechanismtoclearthecounterandstartcountingwithacleanslate.
Asimplepublishsubscribemechanismisusedtodispatchcommandstoboltinstances.Asimplepubsub
interfaceisavailableinchombo,withimplementationfordifferentmessagingprovider.IhaveusedRedis.On
receiptofaticktuple,theUniqueVisitorCounterBoltboltfetchesthecommandifanyfromthepubsubsystem.
ThentheHyperLogLogcounteriscleared.

SummingUp
Wehavegonethroughaexerciseofusingprobabilisticcountingalgorithmforapproximateuniquecount
estimation.TheHyperLogLogalgorithmhasbeenusedfornetworktrafficdataanalysisandqueryplannerin
databases.Stepbystepinstructiontorunthisusecaseinavailableinthistutorialdocument.
About these ads

Sharethis:

Email

LinkedIn

11

Twitter

Facebook

Pinterest

StumbleUpon

Like
Bethefirsttolikethis.

Related

RealtimeTrendingAnalysiswith
ApproximateAlgorithms
In"ApproximateQuery"

RetargetCampaignforAbandonedShopping
CartswithDecisionTree
In"BigData"

ALearningbutGreedyGambler
In"BigData"

AboutPranab
IamPranabGhosh,asoftwareprofessionalintheSanFranciscoBayarea.Imanipulatebitsandbytesforthegoodofliving
beingsandtheplanet.Ihaveworkedwithmyriadoftechnologiesandplatformsinvariousbusinessdomainsforearlystage
startups,largecorporationsandanythinginbetween.Iamanactivebloggerandopensourcecontributor.Iampassionateabout
technologyandgreenandsustainableliving.MytechnicalinterestareasareBigData,DistributedProcessing,NOSQLdatabases,
DataMiningandProgramminglanguages.Iamfascinatedbyproblemsthatdon'thaveneatclosedformsolution.
ViewallpostsbyPranab

ThisentrywaspostedinApproximateQuery,BigData,Mobile,RealTimeProcessing,Stormandtaggedcardinality,mobile,uniquecount.Bookmarkthepermalink.

Mawazo
TheTwentyTenTheme.

BlogatWordPress.com.

Follow

Follow

Follow Mawazo
Get every new post delivered
to your Inbox.
Join 263 other followers
Enteryouremailaddress

Signmeup
Build a website with WordPress.com

Você também pode gostar