Why Is It So Difficult To Parse Addresses?
Precise address data is fundamental to a multitude of services.
The ability to accurately dissect and interpret address components is important for the accurate delivery of mail, managing customer databases, integrating geographic information systems and more.
This blog explores what address parsing is and why it presents such unique challenges.
Discover the intricacies behind making sense of seemingly simple address data and why getting it right is more complicated than it first appears.
TL;DR
Address parsing involves breaking down addresses into their individual components (like street name, city, state, and postcode/ZIP code) to make them understandable for computers.
It’s challenging due to variations in address formats, international differences, ambiguous elements, complex building details, and lack of standardisation.
Despite these difficulties, commercial address parsers achieve high accuracy, and emerging machine learning techniques offer potential for developing custom solutions.
What is Address Parsing?
In essence, address parsing is breaking down and identifying the individual components of an address to make it more understandable and usable for computers. This process ensures that each part of the address is correctly identified, interpreted and standardised for greater accuracy in subsequent applications.
Let’s take a letter that you receive in the mailbox.
On the front, there’s a block of text with your name, street address, city (or suburb or town), state, and postcode (or ZIP code). All these combined tell the postman where to deliver the letter.
Now, let’s say you have a robot assistant, and you want to teach it to understand and organise this information.
You’d instruct the robot to recognise the different parts of the address: This part is the person’s name. This is the street they live on. This part tells us the city, and so on.
Address parsing is like teaching the robot to recognise and separate these individual parts of the address. So, instead of seeing one big block of text, the robot (or computer program) sees the address as different pieces of information:
- name,
- street,
- city,
- state, and
- ZIP code/postcode.
This helps computers and software understand and manage addresses more efficiently, just like how you can easily tell apart the street name from the city when you look at the address on a letter.
Why is Address Parsing Difficult?
Address parsing is difficult because addresses vary greatly in format and structure, both within and across countries. Ambiguous elements (e.g., “St.” for “Street” or “Saint”), complex building details, misspellings and multiple languages add to the challenge.
Additionally, addresses often change due to renaming or updates, and there is very little standardisation in how people enter addresses.
These factors make it hard to create a parser that can accurately interpret all possible address variations.
These are some examples that demonstrate the complexity involved.
Example 1
Address = 64 YORK STREET SYDNEY NSW 2000.
- 64 = Street number,
- YORK = Street name,
- STREET = Street type,
- SYDNEY = Suburb,
- NSW = State,
- 2000 = Postcode
Done, why do people tell me it is difficult….?
Example 2
Address = 6/64 THE BOULEVARDE STRATHFIELD NSW 2135
- 6 = Unit number
- 64 = Street number
- THE = Street name
- BOULEVARDE = Street type
- STRATHFIELD = Suburb
- NSW = State
- 2135= Postcode
Wait, the street name is “THE”?
It should be the “THE BOULEVARDE”!
Boulevarde is a street type as well, but not in this instance! We need a rule for that!
Example 3
Address = WTC BLDG A / TWR 4 MATTHEW FL LEVEL 1 18-38A SIDDELEY ST, DOCKLANDS VIC 3008
This address is significantly more difficult to parse than previous examples, however the address still includes many prefixes that can assist with parsing.
It is not uncommon for many of these prefixes to removed to look more like this address:
Address = WTC A / TWR 4 MATTHEW 1 18-38A SIDDELEY ST, DOCKLANDS VIC 3008
Without the BLDG and LEVEL prefixes, we now have additional complexity to deal with.
Challenges with Address Parsing
- Variability in formats
Addresses can be written in numerous formats.
For instance, “123 Maple St. Apt 4B” and “Apt 4B, 123 Maple Street” represent the same location but are formatted differently. - International differences
Different countries have different address structures. What’s common and straightforward in one country might be unusual in another. For instance, some countries might include districts or regions in their addresses, while others don’t. - Ambiguous elements
Some parts of an address can be confused for others.
For instance, “St.” could be short for “Street” or “Saint.”
Without context, determining the correct interpretation can be tough. - Complex building details
Addresses can have complex unit numbers, building names, floor numbers, and so forth.
Parsing these details correctly, especially when they’re in non-standard formats, can be difficult. - Misspellings and typos
People often make mistakes when entering addresses. A parser needs to be robust enough to handle and possibly correct common misspellings or recognise when an address might be invalid. - Multiple languages and scripts
In multilingual countries or regions, addresses might be written in different languages or scripts. Parsing these requires the program to be aware of multiple linguistic structures. - Historical changes and inconsistencies
Cities change, streets get renamed, postal codes get updated. An address parser needs to be updated regularly to account for these changes, or it should be robust enough to recognise and possibly map outdated addresses to their current counterparts. - Abbreviations and Synonyms
There are multiple ways to refer to the same thing in addresses. For example, “Avenue” might be written as “Ave,” “Av,” or “Avnue.” A parser must recognise all these variations as referring to the same concept. - Lack of standardisation
Unlike some data types where a strict format can be enforced, addresses are often entered by users who have no idea about the backend system’s preferred format. - Embedded information
Sometimes, addresses can contain extra information that’s not strictly part of the address but is crucial for delivery, like instructions or landmarks.
Is Accurate Address Parsing Possible?
Most commercial address parses achieve parsing accuracy at a rate of 97/98%+.
They achieve this through constant development, testing and refinement of their software over many years.
Is it possible to build your own address parsing solution and achieve similar results?
Maybe.
New capabilities and accessibility of machine learning algorithms mean self-developed address parsing solutions may be able to produce results that are acceptable for your use case. But it is worth noting, the solution won’t be easy to develop and there will be inaccuracies. You should carefully weigh up the effort to develop an address parsing solution vs buying a solution off the shelf.
Address Parsing Software Providers
Australia:
- Geoscape Australia: Provides geospatial data solutions, including address parsing and geocoding for Australian addresses.
- Precisely: They offer global solutions, including for Australia, in the realm of data quality and address management.
- Equifax Australia: Offer address cleansing and geocoding solutions.
USA:
- SmartyStreets: Offers address validation, geocoding, and parsing primarily for the U.S. but also internationally.
- Melissa Data: Provides data quality solutions, including address validation, correction, and parsing for the USA and other countries.
- Pitney Bowes: Global solutions, including for the U.S., in data quality and address management.
Canada:
- Canada Post: Their AddressComplete solution provides parsing, validation, and autocomplete for Canadian addresses.
- DMTI Spatial: Offers Canadian geospatial data solutions, which include address parsing and validation.
UK:
- PCA Predict (Loqate): Provides address lookup, validation, and parsing solutions predominantly for the UK but also globally.
- Allies Computing: Their PostCoder web service offers address lookup and validation for the UK and other countries.
- Royal Mail: They have solutions for address validation and parsing for UK addresses.
It’s worth noting that many of these providers offer services for multiple countries, not just the ones listed under their respective headers. For example, a company that provides services in the USA might also cater to UK or Australian addresses.
When considering an address parsing provider, it’s essential to check if they cover the specific regions and countries you need, and if they offer the depth of functionality (e.g., address validation, geocoding, etc.) that your project requires.
Subscribe to our newsletter
Subscribe to receive the latest blogs and data listings direct to your inbox.