This research was supported in part by NSF grants EIA-0220562, IIS-0837716, CNS-0821345, HRD-0833093, and IIP-0829576.
TerraFly technology developed under NSF grant EIA-0220562 allows to visualize aerial imagery, precise street name overlays, and various other overlays. Users virtually fly over imagery via a web browser, without any software to install or plug in. Tools include user-friendly geospatial querying, data drill-down, interfaces with real-time data suppliers, demographic analysis, annotation, route dissemination via autopilots, customizable applications, production of aerial atlases, application programming interface (API) for web sites.
The TerraFly project has been featured on TV news programs, worldwide press, covered by the New York Times, USA Today, NPR, and Science and Nature journals. FOX News worldwide broadcast in July 2007: n0.cs.fiu.edu/...
TerraFly datasets are listed at:
n1.cs.fiu.edu/terrafly.coverage.htm The 40TB collection includes, among others, 1-meter aerial photography of almost the entire United States and 3-inch to 1-foot full-color recent imagery of major urban areas. TerraFly vector collection includes 400 million geolocated objects, 50 billion data fields, 40 million polylines, 120 million polygons, including: all US and Canada roads, the US Census demographic and socioeconomic datasets, 110 million parcels with property lines and ownership data, 15 million records of businesses with company stats and management roles and contacts, 2 million physicians with expertise detail, various public place databases (including the USGS GNIS and NGA GNS), Wikipedia, extensive global environmental data (including daily feeds from NASA and NOAA satellites and the USGS water gauges), and hundreds of other datasets.
Searching Spatial Databases
An increasing number of applications require the efficient execution of nearest neighbor queries constrained by the properties of the spatial objects. Due to the popularity of keyword search, particularly on the Internet, many of these applications allow the user to provide a list of keywords that the spatial objects (henceforth referred to simply as objects) should contain, in their description or other attribute. For example, online yellow pages allow users to specify an address and a set of keywords, and return businesses whose description contains these keywords, ordered by their distance to the specified address location. As another example, real estate web sites allow users to search for properties with specific keywords in their description and rank them according to their distance from a specified location. We call such queries spatial keyword queries.
A spatial keyword query consists of a query area and a set of keywords. The answer is a list of objects ranked according to a combination of their distance to the query area and the relevance of their text description to the query keywords. A simple yet popular variant, which is used in our running example, is the distance-first spatial keyword query, where objects are ranked by distance and keywords are applied as a conjunctive filter to eliminate objects that do not contain them.
Unfortunately there is no efficient support for top-k spatial keyword queries, where a prefix of the results list is required. Instead, current systems use ad-hoc combinations of nearest neighbor (NN) and keyword search techniques to tackle the problem. For instance, an R-Tree is used to find the nearest neighbors and for each neighbor an inverted index is used to check if the query keywords are contained. We show that such two-phase approaches are inefficient.
The Spatial Keyword Search (SKS) engine, a search engine that executes distance-based top-k queries on spatial databases, has undergone some major enhancements in the past months. The enhancements are related mainly to improving response time of queries over large spatial datasets.
Performance of spatial queries with non-spatial numerical predicates are optimized, e.g. list_price<=1000. The optimization allows SKS to process queries in as fast as O(1) time in the best case, while maintaining O(N) worst case time, where N is the number of nodes in the R-Tree data structure. The worst case, however, is rarely hit since there is typically other (non-numerical) predicates that will prune sub-trees. The enhancement involved augmenting to the R-Tree data structure numerical ranges [min, max], and store them at every non-leaf node, superimposing them as you go up in the tree, such that parent min is equal to the minimum value of the children, likewise with max. The root node will have ranges with the largest coverage. Later, during query execution, the query range, e.g. list_price<=1000, is intersected with R-Tree node's range, e.g. 2000<=node_list_price<=3000, and only non-empty intersecting sub-trees are visited; the rest are pruned altogether since we know they do not have objects with values in the query range. The best case is given when the evaluated node is the root, and no further node has to be inspected. Experiments over a real state dataset with more than 350,000 objects showed optimized SKS response time improvement of one order of magnitude as compared to the SKS with post-filtering approach on numerical predicates. A similar approach can be used for date type fields, e.g. built_date>=01-JAN-2008, by internally representing them in Japanese like format (20080101), thus allowing us to treat them like numeric type fields.
A columnar approach is used for signature files in the R-Tree data structure. This allows us to define better signature file coding schemes, optimized depending on the nature of the field. The intuition is that signature files get 'polluted' by mixing all object's textual information in a single signature. By considering individual signature files, we increase the chance of pruning sub-trees at early stages of query resolution. For instance, signature files for fields with low cardinality, that is the number of different values in it is limited to a small number, can be optimized by reducing the number of required signature bits, and guaranteeing no false positives in the best case. In the trivial case when the field is single valued, a single bit is needed to code it. Similarly, with low number of values, say 100, few bytes are needed, 13 bytes, to code them. This is conceptually similar to bitmap indexes used in relational databases. When the number of distinct values exceed a certain threshold, say a thousand, then signature file approach can be used with a reduced signature size, say 50 bytes, and still keep low false positives since not all bits are equally possible. Early experimental results on real datasets have shown response time improvements of at least one order of magnitude, but in general better results are expected. There is ongoing research on this technique to reduce memory requirements and IO overhead; coding scheme variations, signature compression techniques, and storage organizations are being studied. Excesive storage overhead has the disadvantage of decreasing R-Tree's fanout, thus visiting less nodes per IO, which in turn implies doing more (expensive) random IO to visit the same number of nodes.
Information retrieval inverted files are also being studied to direct free-text searches on objects that:
1. have rare keywords.
2. are far away from the query location, and incurr in too many false positives.
This technique allows us to better guess when a query may become run-away by heuristically deciding the best search technique according to the previous criteria. This technique, very often seen in web search engines, in general is orthogonal to the traditional R-Tree structure used in spatial settings.
Comparison operators '>=', '<=' have been overloaded for string data types, e.g. status>=A. This is currently implemented using the post-filtering approach, that is, when SKS returns candidate objects, a filter is applied to verify whether the object's field satisfies the condition.
This is not fully optimized as its worst case is O(N) when all R-Tree nodes have to be visited. A enhancement in this direction is in progress by using optimal string prefixes to prune sub-trees at earlier stages.
The result of this work is a methodology for efficiently executing top-k queries with non-spatial predicates. This methodology proposes the application of filters that prune R-Tree subtrees as early as possible in the query resolution to avoid unfruitful node visits. In particular, the filters are applied in this order:
1. numerical filters
2. field signature filters
3. inverted files filters
4. comparison operators
For example, a spatial query with non-spatial predicates: nbr_beds>=4 AND city=miami AND anyfield=jose+smith will result in using:
numerical filters on field 'nbr_beds',
field signature filters on field 'city' since most likely 'city' has low cardinality,
inverted files filters on the anyfield values since its combination may be rare.
comparison operators post-filters on 'nbr_beds'
Spatial Data Dissemination
In preparation for a discussion on thin client technologies, we will first discuss the existing thin client product our team has designed and developed over the past 8 years. TerraFly is a web-based maps & aerials delivery product with high efficiency of Web data delivery to a thin client, namely a Web browser. TerraFly allows the user to simultaneously visualize multiple datasets in synchronous thin-client windows and drill down and query data, to see results in graphic, tabular and textual forms utilizing the TerraFly PointData and GeoQuery mechanisms. The PointData mechanism presents drill-down information about a point. GeoQuery allows html-client ergonomic point-and-click formulation of complex geospatial queries by unskilled users and efficient computation of results.
The TerraFly team in conjunction with the NASA Regional Applications Center at Florida International University has conducted extensive research on thick and thin clients as well as 2-tier, 3-tier and N-tier architectures in Web mapping application areas.
In the course of our development of TerraFly, we have experimented with several implementation strategies. This experimentation has resulted in our development of the following hierarchy of client types.
Group 2. Java or .NET enabled 'active client' with 'active server' - This class of devices supports the TerraFly '2.5D flight' solution without Final Report: 0220562 Page 24 of 26 the need to resort to interactive video streaming. The interactive video streaming puts considerable load on the server farm and hence, if the client is able to bear most of the computational need for a particular mode of flight, it is desirable to offload computations to the client (even a thin client!), provided the client has idle computation power available and sufficient capacity to perform such computations. We would like to note that since Java and .NET are general purpose languages with rich libraries. They could easily be deployed on the client and re-use existing server-side Web-mapping algorithms to various degrees to reduce the load on the servers and increase the promptness of the interactive response and the Web-map ergonomic features by tapping into the impressive power of the Java/.NET programming languages and the available computational power on the client.
Group 3. '3D flight' without any need of interactive video-streaming
We now discuss the traffic and servers load, fault tolerance as well as flight prediction and pre-fetching issues which arise with Web-mapping solutions in all the classes defined above.
Application Security: The TerraFly research project employs various techniques of data security targeting specific needs of optimal network bandwidth, computational resources availability and desired levels of protection. We have researched and deployed a multi-layered architecture for encryption and secure access with a combination of access control based on high strength encryption and digital signatures, which is fast and efficient due to access control resolution dialog where there is rather small amounts of data to be transferred. According to user-defined priorities, the TerraFly system allows various protection levels for the massive data transfer streams, such as:
Partial encryption of compressed raster data objects, which makes it computationally difficult to reconstitute original image because key data elements are encrypted. This approach allows us to reduce the CPU overhead associated with the encryption and decryption process while still providing some level of data protection. We found this method to be a good application for base maps and base imagery which have rather low levels of confidentiality in most cases.
Full encryption with symmetric encryption (either fast medium strength or slower high strength methods), which could be applied to raster base maps, as well as to higher sensitivity smaller objects (vector map elements which contain higher confidentiality requirements).
Public-key/private-key encryption for the highest confidential elements of data and due to specific requirements of deployment where usage of only public-key/private-key deployment is preferred. This is the slowest method and should be used sparingly.
All these various levels of encryption and access could be combined together to achieve optimal response times within limited network bandwidth and limited computational power available with particular devices and systems. The layered structure of Web maps allows for the application of flexible methods of security stratification in communication channels.
Performing Complex Queries that Blend Geospatial Queries with Keyword Search: Many applications require finding the objects closest to a specified location that contains a set of keywords. For example, online yellow pages allow users to specify an address and a set of keywords, and return businesses whose description contains these keywords, ordered by their distance to the specified address location. The problems of nearest neighbor search and keyword search have been extensively studied separately. However, to the best of our knowledge there is no efficient method to answer spatial keyword queries, that is, queries that specify both a location and a set of keywords.
We have developed algorithms to implement combined spatial and keyword queries. In particular, we use a modification of the R-Tree, called IR^2-Tree (Information Retrieval R-Tree)--described above--which incorporates the principles of Signature Files to efficiently answer top-k spatial keyword queries. We have developed algorithms to maintain the IR^2-Tree and use it to incrementally search for the nearest neighbors that contain the query keywords. Our algorithms have been experimentally and analytically compared to current state-of-the-art methods and have been shown to have superior performance and excellent scalability.
Fast Main-Memory Resolution of Geospatial Queries: We have introduced and are working on further improvement of, the S-Tree, a high performance spatial point indexing structure. S-Tree is a hybrid structure that uses the quad-tree style space decomposition and a hash table for searching. The performance of the data structure is mostly a hash table performance and is a better than O(log(N)), where N is the size of the tree. S-Tree has an easy to compute hashing function that improves the performance and reduces the index storage requirements by eliminating the need to store the quad-tree keys. S-Tree uses a hash table allowing it to start searching from an arbitrary level of the tree without traversing many levels close to the root or to the leaves of the tree as it is done in other algorithms. S-Tree is also very compact, allowing large spatial indexes to be stored in the main memory, and it is optimized for main-memory index residence.
Contributions to Other Disciplines:
The test bed for our research efforts, TerraFly, is likely to contribute to geography and sociology research by making heterogeneous spatial data more readily accessible to scientists, educators, and the public. By enabling those interested in geography and sociology to have a GIS-like interface to a library of imagery and other data that can be georeferenced, scientists will be able to more effectively correlate heterogeneous data sets that have been difficult for the average researcher to bring together. Data sources to be brought together include multi-spectral satellite imagery, aerial photography, demographic data, environmental observations, and many other data sources. Educators and the public will also have access to this data, enabling both formal and informal educational opportunities in geography, environmental science, and sociology.