Analisi spazio-temporale di social media per l identicazione di eventi P. Arcaini 1, G. Bordogna 2,3, E. Mangioni 3, S. Sterlacchini 3 1 Department of Engineering, University of Bergamo 2 National Research Council of Italy, Institute for Electromagnetic Sensing of the Environment 3 National Research Council of Italy, Institute for the Dynamic of Environmental Processes
Mining information from social data Several messages are continuously produced on social platforms They can reveal aperiodic or periodic characteristic of events They can help us in taking decisions in market place analysis territorial monitoring... 2
Proposed process r = <(lat, lon), t, message> 3
Dutch GIT Geology and Information Technology English French German Greek Italian Japanese Korean Portugue Russian Spanish Thai Turkish Case studies Tweets related to the traffic jam (two months): Keywords: #trafficjam, #stau, #engarrafamento,... Number of retrieved tweets: 225 521 20 105 4 92 15 19 83 134 369 161 118 5175 tweets with hashtag #usopen (between the 5 th and the 15th of September 2013) Others: floods, storms, world cup 4
Clustering A clustering algorithm classifies a set of objects in clusters Elements belonging to a cluster are close (according to a distance measure) DBSCAN is a density-based clustering approach: it produces clusters of arbitrary shape it does not require the number of clusters as input it requires two inputs: ɛ the maximum accepted distance between an element e and another element of a cluster C, in order for e to belong to C MinPts the minimum number of elements per cluster 5
Definition of distances for DBSCAN We use different distances in DBSCAN: spatial distance temporal distance spatio-temporal distance modulo-temporal distance spatio-modulo-temporal distance We obtain different kinds of information using different distances 6
Spatial distance It is used for discovering reports that have been submitted by close places we use the Harversine distance DistG( r a, r b, ) 2 arcsin sin lat lat 2 lon lon 2 b a 2 b a cos( lata )cos( latb )sin 2 Other more accurate distances: Vincenty's formulae Network distances could be used: Distance Matrix Service of the Google Maps JavaScript API 7
Spatial clustering Traffic jam World ɛ s = 0.5 km, MinPts=50 8
Spatial clustering Traffic jam Bangkok ɛ s = 0.5 km, MinPts=3 9
Temporal distance It is used for discovering reports that have been submitted in simultaneous or close time points All dates must be converted to a given time zone TZ The temporal distance is defined as follows distt(r a, r b, TZ) changetimezone(t changetimezone(t a,tz) - b, TZ) 10
Temporal distance - Transformation to a TZ 11
Temporal clustering US open 2013 ɛ t = 10 min, MinPts=90 12
Temporal clustering US open 2013 ɛ t = 10 min, MinPts=90 Men s quaterfinals Women s semifinals Men s semifinals Women s final Men s final 13
Spatio-temporal clustering Traffic jam New Delhi ɛ s = 1 km, ɛ t = 20 min, MinPts=2 14
Spatio-temporal clustering Traffic jam Jakarta ɛ s = 1 km, ɛ t = 20 min, MinPts=2 15
Modulo-temporal distance It is used for discovering reports that are close in a periodic time all dates must be considered without their time zone a period G must be fixed (it represents the coarser time unit useful for our reasoning) 16
Modulo-temporal distance - Time zone removal and modulo operation 22/27 17
Modulo-temporal clustering Traffic jam World ɛ mt = 2 min, modulo=day, MinPts=100 18
Modulo-temporal clustering Traffic jam Bangkok ɛ mt = 12 min, modulo=day, MinPts=10 19
Spatio-modulo-temporal distance Traffic jam ɛ s = 1 km, ɛ mt = 20 min, modulo = day, MinPts=3 20
Conclusions We have proposed an approach for clustering social data Different distance measures are used for different purposes: (spatio-)temporal distance for identifying aperiodic events (spatio-)temporal modulo distance for identifying periodic events The approach could be applied to other data sources 21