• Home
  • About

GIScience News Blog

News of Heidelberg University’s GIScience Research Group.

Feed on
Posts
Comments
« Mamapa goes Ludwigshafen - join us today
GSIS Special Issue on Crowdsourcing for Urban Geoinformatics has been published »

How big is OSM? Estimating the size of OpenStreetMap’s history

Sep 21st, 2018 by Sascha Fendrich

In this blog post we report a small study on the size of the OpenStreetMap history data set, which we carried out in the context of developing the OSHDB. We estimated the size of OSM’s history in terms of raw data as it would be represented in a systems programming language such as C or Rust. We were interested in the raw amount of space the data would occupy, i.e., without the overhead of general purpose data formats like XML and without employing data compression that would lead to a computational overhead.

This size estimate gives us

  1. an estimation of the hardware resources required to store and process the full OSM history;
  2. a base reference to which we can compare our efforts of representing the data more compactly.

Overall Methodology and Results

We first defined systems-level data structures for the OSM data objects changeset, member, nd (node reference), node, relation, tag and way. We calculated the size of these data structures and multiplied them with the counts of corresponding XML elements in the OSM history dump. The results are presented in the following table.

Table 1: Number of XML elements, size of data structures and total data size (reported on the full history planet XML file downloaded on 2018-08-08).

XML element name data structure
size (bytes)
XML element count OSM history
size (bytes)
changeset 65 61,149,797 3,974,736,805
member 13 2,127,505,107 27,657,566,391
nd 8 14,537,932,143 116,303,457,144
node 49 7,024,909,604 400,419,847,428
relation 32 24,631,930 1,379,388,080
tag 16 4,861,971,375 77,791,542,000
way 32 957,849,585 47,892,479,250
total size 675,419,017,098
variant 603,352,497,027
compressed XML 111,691,457,431
uncompressed XML 2,001,921,015,360

In Table 1, one immediately sees that the nodes consume more space than the rest of the data, with the way’s node references coming second. Overall, OSM’s history occupies between 600 GB and 700 GB, which is roughly 6 times the size of the compressed XML (or 10 times the .pbf) and 1/3 of the uncompressed XML. This estimate does not include the concrete string representations of the tags which we represent by unsigned integer IDs. To our experience, these strings take less than 1% of the total size of the data and, hence, are not a major factor of our estimate.

The raw data structures have been designed to closely represent the XML data structure. That is, there is still some room for space optimizations. We could leave out the redundant user id stored at each node, way and relation because it is also stored in the corresponding changeset. Further, we could store the the visibility-bit in the highest bit of the version number, instead of representing it as an otherwise unused byte. With these two simple optimizations we could save 9 bytes per node, way and relation, which results in the total size reported as “variant”. With the current number of tag keys and tag values, we could also halve the size of the tag data structure, which would save another 38GB.

Concerning question (1), the hardware resources, this result is quite promising. As the above numbers report only the raw data size, we would have to add some indices to make effective use of the data. However, assuming the indices to be smaller than the data itself, the data would still fit into less than 1TB. In the era of TB SSD hard drives this is not too big.

Technical Details

For the technically interested reader, the remainder of this blog post discusses the detailed layout of the data structures and how we computed the above numbers. Recall that our goal is to estimate the size of OSM’s history if we would store it in plain old data structures. This shall serve us as a baseline for evaluating our effort in finding more compact representations.

Data Model

We describe the data structures with a notation roughly borrowed from Rust in order to easily compute the size of the data structures. That is, we use types like u64 denoting an 64-bit unsigned integer or i32 denoting a 32-bit signed integer, which consume 8 and 4 bytes respectively. A type in square brackets denotes an array of objects of the given type. The data structures closely follow OSM’s XML format, so there is not much to explain here. The comment in the first line of each structure reports the total size of the data structure not including the arrays because the latter will be counted separately.

changeset { // 65 bytes
  id: u64,
  creation_time: time64,
  closing_time: time64,
  open: bool,
  user_id: u64,
  min_lon: i32,
  min_lat: i32,
  max_lon: i32,
  max_lat: i32,
  num_changes: u32, // with current limit of 10000, u16 would suffice here
  comments_count: u32,
  tags_count: u64,
  tags: [tag],
}

node { // 57 bytes
  id: u64,
  lat: i32,
  lon: i32,
  version: u64,
  timestamp: time64,
  changeset: u64,
  user_id: u64,
  visible: bool,
  tags_count: u64,
  tags: [tag],
}

member { // 13 bytes
  type: u8,
  reference: u64,
  role: u32,
}

relation { // 56 bytes
  id: u64,
  version: u64,
  timestamp: time64,
  changeset: u64,
  user_id: u64,
  visible: bool,
  members_count: u64,
  members: [member],
  tags_count: u64,
  tags: [tag],
}

tag { // 16 bytes (maybe 2x u32 = 8 bytes are sufficient here)
  key: u64,
  value: u64,
}

way { // 42 bytes
  id: u64,
  version: u64,
  timestamp: time64,
  changeset: u64,
  user_id: u64,
  visible: bool,
  nodes_count: u16,
  nodes: [u64],     // node ids
  tags_count: u64,
  tags: [tag],
}

Computing the Estimates

The next step was to compute the size of the full history planet XML file. The size of the compressed XML file was easily read from the file system. Because the uncompressed XML file would not fit on the hard disk of the employed machine, we went by running lbzcat history-latest.osm.bz2 | wc -c for computing the size of the uncompressed XML.

The next step was to process the XML file in order to count the numbers of occurrences for each of the OSM data types. To this end, we implemented a small program that counts the number of occurrences of XML tags in the file. Assuming the file is well-formed, the sum of start tags and empty-element tags corresponds to the number of XML elements. We used lbzcat to uncompress the packed XML file in parallel and piped the result into our program. In order to validate the results and to gain some experience of how suitable different programming approaches are for our purposes, we implemented this program in three different programming languages: Python, Haskell and Rust.

The Python and Haskell implementations exploit the fact that OSM’s XML dump stores each XML tag on a separate line. So counting the occurrence numbers was easily done by inspecting the first word on each line. The Rust implementation employs the full fledged XML pull parser quick_xml as we wanted to be sure that the previous exploit was correct.

Performance Considerations

We computed the above numbers on an Intel® Core™ i7-7500U CPU @ 2.70GHz × 4. The Python and Haskell implementations took about 10 hours to process the data, the Rust implementation only 4 hours. This is particularly remarkable because the Rust implementation solves a more complex task of processing the full XML.

We also have to consider that the bz2-compression causes a significant overhead. For instance, the Rust implementation was running at 60% CPU while lbzcat was eating up the remaining 340% CPU of the 400% CPU provided by the four cores. Limiting lbzcat to three cores made our implementation use even less CPU, which indicates that the decompression is the limiting factor here. In fact, only decompressing the file by running lbzcat history-latest.osm.bz2 > /dev/null took 3 hours. Therefore, we measured the performance of our implementation on a smaller region that we were able to uncompress in advance. Processing directly the uncompressed XML showed a performance improvement by 30-40% compared to uncompressing the file on the fly. This roughly matches the fact that our program used 100% CPU with a priori decompression, which would take our 4 hours down to roughly 2.5 hours. When exploiting that we have a fixed set of possible XML tags, we can speedup our implementation by factor 2 on the small data set, which would take down the processing time to 1.25 hours for the global history dump.

Of course, XML is a less than optimal data format for such a task. Our main reason for working with the XML dump was the ease of implementing the processing. Both the Python and the Haskell program have less than 10 lines of code and were implemented in a few minutes each. With about 20 lines of code, the Rust implementation is slightly more complex as it really processes the XML. The obvious alternative is the full history planet PBF file, which is designed to be more friendly to the machine but involves considerably more work for the programmer. For comparison, we implemented the same counting process for the PBF dump in about 50 lines of C++ using libosmium. The computation took 0.5 hours with the process running at 200% CPU. Apparently, reading the data is still a bottleneck that prevents the process from running at 400% CPU.

The major factors we identified for the performance benefit of the C++ implementation are:

  • The PBF dump is easier on disk IO: 68GB vs 2TB is about 1/30 of the data transfer;
  • The PBF dump is easier on processing: estimated 700GB of structured data vs 2TB of unstructured data is about 1/3 of the data amount and reduces parsing and structuring the data;
  • The PBF dump is easier on parallelization due to the data being sharded.

Considering these factors, the performance of the Rust implementation is not bad.

Tags: ohsome example

Posted in Uncategorized

Comments are closed.

  • About

    GIScience News Blog
    News of Heidelberg University’s GIScience Research Group.
    There are 1,480 Posts and 0 Comments so far.

  • Meta

    • Log in
    • Entries RSS
    • Comments RSS
    • WordPress.org
  • Recent Posts

    • Press release: Understanding the Spatial and Temporal Dimensions of Landscape Dynamics
    • Wanna give feedback about HeiGIT services? Survey Deadline extended to 05.03.
    • An ohsome Railway Network Visualization and Analysis
    • Humanitarian OSM Stats: How to monitor humanitarian mapping in the HOT Tasking Manager? - Part 3
    • ISCRAM GIS Track: Deadline extended for WiP and Practitioner papers: February 21, 2021
  • Tags

    3D 3DGEO Big Spatial Data CAP4Access Citizen Science Colloquium crisis mapping Crowdsourcing data quality disaster DisasterMapping GeoNet.MRN GIScience heigit HOT humanitarian Humanitarian OpenStreetMap team intrinsic quality analysis landuse laser scanning Lidar Mapathon MapSwipe Missing Maps MissingMaps ohsome ohsome example Open data openrouteservice OpenStreetMap OSM OSM History Analytics OSMlanduse Quality quality analysis remote sensing routing social media spatial analysis Teaching terrestrial laser scanning Twitter VGI Wheelchair Navigation Workshop
  • Archives

    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
    • October 2014
    • September 2014
    • August 2014
    • July 2014
    • June 2014
    • May 2014
    • April 2014
    • March 2014
    • February 2014
    • January 2014
    • December 2013
    • November 2013
    • October 2013
    • September 2013
    • August 2013
    • July 2013
    • June 2013
    • May 2013
    • April 2013
  •  

    September 2018
    M T W T F S S
    « Aug   Oct »
     12
    3456789
    10111213141516
    17181920212223
    24252627282930
  • Recent Comments

    GIScience News Blog CC by-nc-sa Some Rights Reserved.

    Free WordPress Themes | Fresh WordPress Themes