# Implied Geolocation * Some objects are explicitly geolocated: "Austin, Texas", "Cornell University", the "USS_Constitution". * Some objects are not only geolocated, they are 'places' -- present as well in the geonames dataset. The estimator is as follows: * a best-estimate longitude and latitude * the radius of uncertainty for the point * the likelihood the point is erroneous 12000 krec articles 7000 krec geonames 400 krec dbpedia-geo_coordinates_en.json 87 krec dbpedia-geonames_links.json ### dispatch geolocation estimates along links * Send every neighbor your geoestimate accumulate all neighbors' geoestimates. In this drawing, the vertical bars show implied locations; six reasonably nearby each other and two with large error. | | | | || | | ----+------+-+-------+--++------- // ----+---- // --+----- But of course in some places I _know_ the location | X | | | || | | ----+----X-+-+-------+--++------- // ----+---- // --+----- X `-- actual location Why are the estimates spread from the actual? * intrinsic size of the actual: the graph neighbors of "Texas" are spread over a much larger area than the graph neighbors of "Yee-Haw Junction, FL". * strength of the relationship: for example, this naive model can't tell the difference between "X is located in Y" and "X borders Y" * errors in the relationship: the link might be irrelevant or not explanatory for any reason -- anything from "X has the same area as Virginia" to a hacked page. * multi-modal location: Davey Crockett (TODO: verify) was from XXX to XXX the representative of Tennesee (location #1) to the US Congress in Washington, DC (locaton #2). Upon losing re-election, he famously said "You can all go to hell, I am going to Texas"; he died during the battle of the Alamo. The most robust assignment of a geolocation to "Davey Crockett" would look something like the following cartoon: ____ / \ ------ / \ / \ +-+ | |_____| |____/ \ Tennesee Texas DC So what we're going to do is track two separate types of error: * the likelihood the estimate is drawn from purely irrelevant points * assuming the estimates are relevant, the fuzziness of the implied geolocation. * ?? only use estimates with some strength ?? * For all known points, the number of neighbors that are irrelevant