Extracting numbers from a string 2: Consecutive numbers at end 3

This is the second in a series of discussions on the techniques available for extracting numbers from an alphanumeric string. In the first instalment in this series (which can be found here) I looked at extracting consecutive numbers which appear at the start of the string, e.g. 123ABC456.

In this post I will concentrate on techniques for extracting numbers from a string where:

  • The numbers are consecutive
  • The consecutive string of numbers is found at the very end of the string
  • The desired result is to have those consecutive numbers returned to a single cell

As previously, for each of the given solutions, we need to test its soundness in two separate cases: firstly, where there are no numbers elsewhere in the string, e.g. ABC456 and secondly, where there are some numbers elsewhere in the string, either at the start, e.g. 123ABC456, or in the middle, e.g. ABC123DEF456.

And just to reiterate my statement from the first post in this series, although some of the solutions which I will be presenting work equally well whether there are additional numbers in the string or not, and some do not, that is not to say that we should prefer the former in all cases. Instead, where we have sufficient information available, we should make the sensible decision to choose that set-up which is the simplest and least resource-intensive.

Although the task of extracting consecutive numbers from the end of a string is sufficiently different to that of extracting consecutive numbers from the start of a string as to warrant a separate discussion, the techniques involved are nevertheless not that dissimilar in both cases, and some of the constructions I will be presenting will bear a strong resemblance to those given in the previous post in this series.

Nevertheless, not all the solutions available here are equally appropriate for cases where the desired extraction is at the start of the string. Indeed, the very first technique which I am about to outline is particular to the problem of extracting numbers from the end of a string and does not have a reasonable equivalent in the case where the desired extraction is instead at the start of the string.

So let’s look at the first of our candidates:

2.1 MIN with FIND

Array formula? NO
Works equally if the string also contains numbers at the start? NO
Any strings which cause it to fail? NO

The set-up in question is:

=0+MID(A1,MIN(FIND({0,1,2,3,4,5,6,7,8,9},A1&1/17)),LEN(A1))

Let’s first see how this works if the string in A1 contains no additional numbers other than those at the end, e.g. ABC456, so that the above is:

=0+MID("ABC456",MIN(FIND({0,1,2,3,4,5,6,7,8,9},"ABC456"&1/17)),LEN("ABC456"))

The use of the fraction 1/17 as the value to catenate to the end of the string in A1 is my own personal choice (and does in fact often vary!). Readers can see here for a discussion on this peculiar-looking choice if they wish. The “standard” method is to concatenate A1 with the string “0123456789”.

Both amount to the same thing in the end, though to reiterate what I stated in that post, the concatenation of A1 with a string which contains all digits from 0-9 is necessary so as to prevent MIN from returning an error. Since, if A1 did not contain all of the digits from 0-9, then at least one of the returns from the FIND function would be an error. And since MIN is here not capable of ignoring error values the whole construction would therefore also result in an error.

By first catenating a suitable value (e.g. 1/17, 3^45, “0123456789”) to the end of the string in A1, we guarantee that it will contain at least one of each of the digits from 0-9 (so that all 10 returns from FIND will be non-error values), whilst ensuring, by virtue of having placed this string at the end of A1, that we do not affect the required output for the MIN function.

Just to clarify, resolving this fraction 1/17 in the above, we have:

=0+MID("ABC456",MIN(FIND({0,1,2,3,4,5,6,7,8,9},"ABC456"&0.0588235294117647)),LEN("ABC456"))

and so the point of using a “pandigital” value as 1/17 is revealed, since its 15 decimal places contain at least one occurrence of all of the integers from 0-9 inclusive. Resolving the concatenation gives:

=0+MID("ABC456",MIN(FIND({0,1,2,3,4,5,6,7,8,9},"ABC4560.0588235294117647")),LEN("ABC456"))

It is probably not that widely known that, in certain constructions, MIN possesses the ability to coerce an array of returns, and without the need for CSE at that. Here, that is precisely the case, and so FIND will operate over all 10 values passed to it, not just the first as would be the case if there was no appropriate coercion being applied to the function. We then have:

=0+MID("ABC456",MIN({7,19,13,14,4,5,6,21,11,17}),LEN("ABC456"))

which is clearly:

=0+MID("ABC456",4,LEN("ABC456"))

and so we know that we will be beginning the extraction at position 4 of the string “ABC456“, which is clearly as required. All that remains is to decide upon a suitable value for MID’s third parameter, num_chars.

MID (and RIGHT and LEFT, to name just two more) have the useful property, when passed a num_chars parameter which is greater than the length of the actual string in question, of not erroring. Instead, in cases where the parameters would theoretically result in a return either wholly or partially beyond the end of the string, these functions return only those characters up to the end of the string.

Here then, it is not so important that we determine the precise value of this parameter (which would be 3 in the case of “ABC456”), and we can take advantage of this lenient property of MID to pass any value we like, provided of course that is greater than or equal to 3.

You will often see static values used in this type of construction, 99 for some reason being a very popular choice. However, although such choices are near-certain to be sufficient, we cannot guarantee that they will work in all cases.

Hence my preference for a dynamic construction to determine this value – LEN(A1) – the point being that no individual substring can be greater in length than the entire string itself. Hence, our formula becomes:

=0+MID(A1,4,6)

which is 456, as required.

Now let’s take a look at what this method returns if the string in A1 contains additional numbers, e.g. 123ABC456. Readers can probably already see where this will fail, since this time our MIN construction will look like:

MIN(FIND({0,1,2,3,4,5,6,7,8,9},"123ABC4560.0588235294117647"))

which is:

MIN({10,1,2,3,7,8,9,24,14,20})

i.e. 1, and so, since we are passing this value to MID for its start_num parameter, our result will come from the very start of the string, which is clearly not at all what we want.

It is evident, then, that any additional numbers in the string will be preferred by the MIN function in this set-up, and so it is not to be recommended in those cases. Of course, we could first create a substring of the string in A1, beginning with “C” say (i.e. “C456”), by detecting the last non-numeric in the string, and then passing this to our MID/FIND construction. However, the additional complexity in this set-up would make this a very impractical choice indeed, particularly since we have some adequate (and reasonably short) alternatives, as below.

2.2 LOOKUP with RIGHT (dubious)

Array formula? NO
Works equally if the string also contains numbers at the start? YES
Any strings which cause it to fail? YES

A popular choice if the various Excel help forums and sites around the internet are anything to go by, though (seemingly unbeknownst to those referees) unfortunately not at all rigorous.

The set-up in question is:

=-LOOKUP(0,-RIGHT(A1,ROW(INDIRECT("1:"&LEN(A1)))))

Readers are invited to refer to the previous post in this series for a detailed explanation as to the functioning of this construction, which in any cases differs from the version there only by virtue of using RIGHT, not LEFT.

As such, I will not re-iterate the reasons for its being called “dubious” here. Suffice to say that readers may wish to investigate the returns of this construction when A1 contains either of the strings “JAN18” or “123E4“.

2.3 RIGHT, MATCH and ISNUMBER

Array formula? YES
Works equally if the string also contains numbers at the start? YES
Any strings which cause it to fail? NO

Again, as in the case with extracting numbers from the start of a string, we are much better off examining characters on an individual basis rather than examining sections in bulk, which, as we just saw with the LOOKUP example, can lead to undesired results.

Querying Excel as to which chain of characters comprises the longest string of numericals may seem like a reasonable approach, but, as we’ve seen, the occasional alphanumerical chain can also be interpreted as such, and so our attempt fails.

If we restrict ourselves to looking at each individual character in a string, however, we can be 100% certain that our test for numericalness is quite sound: there are no single non-numerical characters (that I know of) which Excel can mistakenly interpret as numerical.

And this is where this combination of RIGHT, MATCH and ISNUMBER has the upper hand. The precise construction is:

=0+RIGHT(A1,MATCH(FALSE,ISNUMBER(0+MID(A1,1+LEN(A1)-ROW(INDIRECT("1:"&LEN(A1))),1)),0)-1)

As before, let’s first test it on a string containing no other numbers than those to be extracted from the end, e.g. ABC456. Since the length of this string is 6, we can substitute both A1 and LEN(A1) into the above to give:

=0+RIGHT("ABC456",MATCH(FALSE,ISNUMBER(0+MID("ABC456",1+6-ROW(INDIRECT("1:"&6)),1)),0)-1)

Resolving the ROW construction gives:

=0+RIGHT("ABC456",MATCH(FALSE,ISNUMBER(0+MID("ABC456",1+6-{1;2;3;4;5;6},1)),0)-1)

which is:

=0+RIGHT("ABC456",MATCH(FALSE,ISNUMBER(0+MID("ABC456",{6;5;4;3;2;1},1)),0)-1)

and so we see from the array of values which we will be passing to MID for its start_num parameter – {6;5;4;3;2;1} – that we will be examining each character in “ABC456” in turn from last to first.

We then test each of these characters as to whether it is a number or not (first coercing with a suitable mathematical operation), then stop at the first occurrence where this is not the case. Evidently this must coincide with the character immediately prior (in left-to-right terms) to that at which we wish to begin our desired extraction.

To continue the formula resolution, the above becomes:

=0+RIGHT("ABC456",MATCH(FALSE,ISNUMBER(0+{"6";"5";"4";"C";"B";"A"}),0)-1)

The “suitable mathematical operation” I chose here was to add zero (others, e.g. multiplying by 1, double-negation, etc. are also perfectly good), which coerces the text strings representing numbers into actual numbers. Naturally this operation is invalid on anything which cannot be regarded as a numerical, and so we have:

=0+RIGHT("ABC456",MATCH(FALSE,ISNUMBER({6;5;4;#VALUE!;#VALUE!;#VALUE!}),0)-1)

Resolving the ISNUMBER function, which has the nice property that it ignores error values within the array passed to it, gives:

=0+RIGHT("ABC456",MATCH(FALSE,{TRUE;TRUE;TRUE;FALSE;FALSE;FALSE},0)-1)

and we can easily see that the MATCH portion resolves to:

=0+RIGHT("ABC456",4-1)

This value of 4 represents the position of the first non-numerical character in the string “ABC456” as we approach from the right. We then simply subtract 1 from this value to obtain the desired parameter to pass to RIGHT, viz:

=0+RIGHT("ABC456",3)

i.e. 456, as desired.

Now let’s see how this set-up handles strings which contain additional numbers to that at the end, e.g. 123ABC456

Well, you can probably tell straight away that the presence of these additional numbers will not affect the output. Since our approach involved attacking the string from the right, we should not be duly concerned about any additional numbers to the left of the desired extraction. And indeed you can easily verify that:

=0+RIGHT("123ABC456",MATCH(FALSE,ISNUMBER(0+MID("123ABC456",1+LEN("123ABC456")-ROW(INDIRECT("1:"&LEN("123ABC456"))),1)),0)-1)

boils all the way down to:

=0+RIGHT("123ABC456",MATCH(FALSE,ISNUMBER(0+{"6";"5";"4";"C";"B";"A";"3";"2";"1"}),0)-1)

and, again, since we are approaching from the right, the result of our MATCH will be no different from before: “C” is still the first non-numerical we come across, the additional numbers having done nothing to change this.

Readers may wish to continue the resolution of the above to the very end to be sure, though rest assured that the result will be the same.

A good construction then, and one whose extra (relative to the previous two set-ups) function calls and need for array-entry are more than compensated for by its rigour and ability to cope with additional numbers in the string.

That just leaves one set-up, then, and one which is practically identical to the above, the only amendments to which being designed for the sole purpose of shaving off a few characters. Although that may not seem like much, given the choice between two solutions which are otherwise identical, it nevertheless makes sense to prefer the shorter of the two.

2.4 RIGHT, MATCH and ISERR

Array formula? YES
Works equally if the string also contains numbers at the start? YES
Any strings which cause it to fail? NO

The set-up here is:

=0+RIGHT(A1,MATCH(1,0+ISERR(0+MID(A1,1+LEN(A1)-ROW(INDIRECT("1:"&LEN(A1))),1)),0)-1)

Since not much has changed from the previous formula, and again testing on the string “123ABC456“, we know that this will resolve to:

=0+RIGHT("123ABC456",MATCH(1,0+ISERR(0+{"6";"5";"4";"C";"B";"A";"3";"2";"1"}),0)-1)

This time, instead of finding the first occurrence (from the right) which does not meet our test for numericalness, we instead take advantage of the fact that any character within the string which is not a number will produce an error when coerced by a suitable mathematical operation, here addition of zero.

Hence, the above resolves to:

=0+RIGHT("123ABC456",MATCH(1,0+{FALSE;FALSE;FALSE;TRUE;TRUE;TRUE;FALSE;FALSE;FALSE},0)-1)

The only other small difference between this and the previous formula is that we here first coerce the array of Booleans resulting from MATCH to numericals so that we can use 1 for the lookup_value in place of the ever-so-slightly longer TRUE. The above then becomes:

=0+RIGHT("123ABC456",MATCH(1,{0;0;0;1;1;1;0;0;0},0)-1)

i.e.:

=0+RIGHT("123ABC456",4-1)

and we have precisely the same intermediate result as we obtained previously.

To conclude then, just as in the first instalment in this series, in which we looked at extracting consecutive numbers from the start of a string, here the solutions proposed fall largely into the same categories.

We again have a succinct, non-array solution using LOOKUP available which works in all but a tiny minority of cases and which would otherwise be ideal as a candidate for best, general solution.

We also, just as before, have a neat, concise set-up which works perfectly when there are no additional numbers in the string (the MIN/FIND combination here; the LEFT/COUNT combination in part 1), but which fails otherwise.

And, again, we must give thanks to MATCH for lending us its functionality in providing the most rigorous – if slightly longer – solutions. The functionality to which I allude – and which is pivotal to the success of this set-up with MATCH – is the ability to parse a string of Boolean TRUE/FALSE responses to the individual tests as to whether each character within the string is a number or not.

And this is the lesson to be learnt, then. It seems that, try as we might to come up with alternative constructions which attempt to bypass the longer route which involves inspecting each individual character within the string, we cannot avoid hitting some sort of wall.

These alternatives either lack rigour, in that they do not work for all possible strings (as in the set-up with LOOKUP), or else fail if there are additional numbers present within the string (as in the MIN/FIND set-up).

Of course, as I said at the beginning of this post, that does not in any way mean that we should not use either of these two constructions. Indeed, the MIN/FIND combination is always my preference if I am certain that the strings on which I am performing the extraction contain no additional numbers.

I cannot quite say the same for the LOOKUP solution, however, since, as I said in the previous instalment in this series, I personally don’t like the idea of having to go through a set of data in order to first verify whether any of the strings on which I am going to be performing my extractions do not contain any of dozens of different alphanumeric combinations.

So, overall, MATCH + ISERR it is then, a second win for this combination in as many stages. But how will it fare when we come to considering other, more complex string extractions? Will this set-up still offer us the best means by which to obtain our result, or will we have to look to other functions for help?

We’ll see! Part 3 to follow shortly!

3 comments

  1. Pingback: Extracting numbers from a string 3: All numbers to individual cells | EXCELXOR

  2. Pingback: Extract number from the right of the string

  3. Pingback: Extracting numbers from a string 4: All numbers to a single cell | EXCELXOR

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s