Extracting numbers from a string 1: Consecutive numbers at start 8

This is the first in a series of discussions on the techniques available for extracting numbers from an alphanumeric string. Since we often have many different solutions at our disposable for such tasks, I will attempt to present what I feel are the principal candidates and, for each of these set-ups, discuss the merits and potential drawbacks inherent in each.

In the next instalment in this series I shall look at extracting consecutive numbers which appear at the end of the string, e.g. ABC123. In later posts I will deal with cases in which the desired numbers to be extracted are interspersed within the string in groups of one or more, e.g. ABC12DE345-FG6H789, in which case we may be interested in extracting either the number 123456789 into a single cell or each of 12, 345, 6 and 789 into four separate cells.

I shall also consider in future posts cases in which there may be several numbers within a string, though from which we wish to extract perhaps only one (or more) of these numbers, and for which our choice of extraction is based upon one or more criteria. For example, given a string of the form X12-X34-X56-X78-X90 we may wish to develop a technique which extracts the number immediately preceding the fourth occurrence of a hyphen within that string.

But for this first post I will concentrate on extracting numbers from a string where:

  • The numbers are consecutive
  • The consecutive string of numbers is found at the very start of the string
  • The desired result is to have those consecutive numbers returned to a single cell

For each of the given solutions, we need to test its soundness in two separate cases: firstly, where there are no numbers elsewhere in the string, e.g. 123ABC and secondly, where there are some numbers elsewhere in the string, either at the end, e.g. 123ABC456, or in the middle, e.g. 123ABC456DEF.

Some of the solutions which I will be presenting work equally well whether there are additional numbers in the string or not, and some do not, and in the analysis for each construction I shall make it clear which is the case.

However, it is not necessarily true to say that, given a choice, we should prefer the more generic solution in all cases. Indeed, such solutions are likely (though not certain) to involve additional complexity, and therefore if our dataset consists of strings for which there is only a single batch of numbers (either at the end or the beginning) then we should obey the logic of Occam’s razor and opt instead for a solution which is tailored for just those cases.

We also need to assert whether the solution works for all possible variations of the string type in question. This may sound self-evident, but, as you’ll see, perhaps the most well-known of all solutions to this problem does not actually meet that criterion, contrary to popular belief.

Readers should also that, although I sometimes employ constructions which use a single INDEX function so as to circumvent having to commit the formula as an array formula (see here for details), I will not be doing so here: all array formulas will be left as just that. Having said that, if a “natural” (i.e. one not obtained via this method of inserting an extra INDEX function) non-array alternative is available, then it is likely, all else being equal, that I shall tend to favour that over alternative array set-ups when it comes to giving my final verdict as to which is the best solution.

Perhaps surprisingly, the problem of extracting consecutive numbers from the start of a string is in general a little more complex than that in which the numbers are at the end of the string, mainly by virtue of the fact that we do not have a suitable alternative to the MIN/FIND set-up (which will be discussed in detail in the follow-up post to this one, though readers may see here in the meantime).

Even more surprising, however, is that if you were to search around the internet for solutions to this particular problem then there is a strong chance that you would come away with a firm belief that the construction to employ in these cases is the one I am about to present, though which, as you’ll see, is fundamentally flawed:

1.1 LOOKUP with LEFT (dubious)

Array formula? NO
Works equally if the string also contains numbers at the end? YES
Any strings which cause it to fail? YES

I should first clarify that I don’t wish to contradict my earlier statement about using solutions which are tailored for particular circumstances. By “flawed” and “dubious” I do not mean that this solution should not be employed at all; by all means use it. Though what must be borne in mind (and this is what the vast majority of the sources which recommend this set-up either don’t point out or – worse – don’t realise, and the reason for my calling it dubious) is that this construction is not guaranteed to work for all strings.

The set-up in question is:

=-LOOKUP(1,-LEFT(A1,ROW(INDIRECT("1:"&LEN(A1)))))

First let’s look at how it works, and then see where it may fail. And the string we’ll test this on is as given above, i.e. 123ABC

The part:

ROW(INDIRECT("1:"&LEN(A1)))

should hopefully be familiar to most readers, and simply generates a single-column vector consisting of integers from 1 up to the number of characters within the string in A1. Hence, in this case it resolves to:

{1;2;3;4;5;6}

Readers should note that I will be using this construction with INDIRECT from now on in my analysis. Alternatives you will see which attempt to either reduce the number of function calls, e.g.:

ROW($1:$99)

or to lessen the volatility of the construction, e.g.:

ROW(INDEX($A:$A,1):INDEX($A:$A,LEN(A1)))

will not be considered here. The former lacks rigour, in the sense that row insertions within the range will cause issues (the fact that 99 is often seen here, a value which is not guaranteed to be sufficiently large, is a further detriment to this set-up). The second is a nice construction, and one which is immune to row insertions, though it is perhaps a touch convoluted (not to mention the fact that it is not entirely non-volatile, being what is referred to as “volatile only at workbook open”). Besides, more to the point is that I have a suspicion that such a method is a touch unnecessary: volatility is overrated, in my opinion.

Hence, our formula resolves to:

=-LOOKUP(1,-LEFT(A1,{1;2;3;4;5;6}))

and, since LOOKUP is lending the necessary array coercion to the construction, LEFT here will return not a single value but an array of six values, each corresponding to applying LEFT to the string in A1 with a num_chars parameter of 1, 2, 3, 4, 5 and 6, i.e.:

=-LOOKUP(1,-{"1";"12";"123";"123A";"123AB";"123ABC"})

Multiplying this array by -1 (which is effectively what preceding the array by a single minus sign does) results in:

=-LOOKUP(1,{-1;-12;-123;#VALUE!;#VALUE!;#VALUE!})

For readers not sure how LOOKUP operates, it is sufficient here to point out that, if the lookup_value is not found within the lookup_vector, and providing that there are no values within the lookup_vector greater than the lookup_value, then the function returns the last (in this case numerical) entry from the lookup_vector. The function also has the useful property that it ignores any error values within the lookup_vector.

This then was the reason for first negating the values in the lookup_vector, which, by creating an array consisting of either negative numbers, zeroes (it’s possible our desired extraction could begin with 0, e.g. 0123ABC) or error values, ensures that a lookup_value of 1 will always be a sufficient and legitimate choice.

Here, for example, since 1 is not found within the lookup_vector (and cannot be, if you think about it, unless we had some initial string such as e.g. –123ABC, which I will not be considering in this post), then the formula returns the last numerical value in the array, i.e. -123.

Of course, this is by no means the only way to handle this construction; we could equally leave the lookup_vector un-negated. Here, though, we would have to spend a little thought to ensure that our choice of lookup_value is sufficiently large (1 will always be such in the construction I chose).

Not that this is difficult, and indeed you will more than likely see this un-negated construction in practice, using a lookup_value of either the so-called “Big Num” (i.e. 9.99999999999999E+307, the largest allowed positive number in Excel) or, as some prefer, a mere “very large” value such as 10^10 (which has the benefit of being not so unwieldy-looking as “Big Num”).

By first negating values in the lookup_vector, and so allowing a choice of unity for the lookup_value, I like to think that there is some efficiency, even simplicity, in the set-up I chose perhaps lacking in the alternative constructions given above. Of course, this is just a personal preference, and readers should obviously choose whatever values they wish, providing of course that those choices are rigorous.

Hence, since the LOOKUP above returns -123, we simply re-negate this value to obtain 123, as desired.

So why is this set-up “dubious”? Well, because it just so happens that there are certain strings which, when submitted to this construction, will return results other than those desired. For example, taking the entry in A1 to be 12JUN, then:

=-LOOKUP(0,-LEFT(A1,ROW(INDIRECT("1:"&LEN(A1)))))

becomes:

=-LOOKUP(0,-LEFT(A1,{1;2;3;4;5}))

which is:

=-LOOKUP(0,-{"1";"12";"12J";"12JU";"12JUN"})

and here you may already be able to guess where this is going, since this now becomes:

=-LOOKUP(0,{-1;-12;#VALUE!;#VALUE!;-41802})

where 41802 is the result of Excel (unhelpfully, in this case) interpreting the string “12JUN”, after being coerced into a numerical, as the date 12/06/2014 (by virtue of 2014 being the year at the time of writing, of course).

Naturally this will be the return of the LOOKUP function, and so the above will resolve to, not 12 as desired, but 41802.

And this of course is not the only string which will generate such a return. In fact, any combination of alphanumeric characters which may be interpreted by Excel as corresponding to a given text format will be treated as such. So, for example, 30SEP, 01FEB, etc. will also generate incorrect results using this set-up. And naturally this issue is not exclusive to Excel versions with English as their regional language: any appropriate mixing of letters and numbers which also happens to qualify as a valid Excel date string within the version being used will suffer the same fate.

And it’s not only dates which are the culprits. Indeed, if we consider cases where the string in question contains numbers at the end as well as at the start, an entry such as 123E3, passed to this set-up, will actually result in a value of, not 123, but 123,000. Try typing 123E3 into a cell formatted as General and you’ll see that Excel happily assumes you must have meant this as shorthand scientific notation for 123 x 103.

Consequently, although it is true that this set-up with LOOKUP has several advantages over others (as we’ll see), such as its relative brevity, simplicity and the fact that it does not require array-entry, it is only to be recommended for use in cases where it can be stated with absolute certainty that none of the strings to which we will be applying the construction could possibly be interpreted as something other than that which we intend, i.e. an alphanumeric string.

Personally, I don’t like the idea of having to go through a set of data, especially if it’s a particularly large set, e.g. thousands of cell entries, in order to first verify whether any of the strings on which I am going to be performing my extractions do not contain any of dozens of different alphanumeric combinations. I leave it to the reader to decide whether they also feel the same.

1.2 LEFT with COUNT

Array formula? YES
Works equally if the string also contains numbers at the end? NO
Any strings which cause it to fail? NO

The (array formula) set-up here is as follows:

=0+LEFT(A1,COUNT(0+MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)))

Let’s see how it works.

Well, we saw that the construction:

ROW(INDIRECT("1:"&LEN(A1)))

resolves to:

{1;2;3;4;5;6}

so we will have:

=0+LEFT(A1,COUNT(0+MID(A1,{1;2;3;4;5;6},1)))

This time then we are passing an array of six values to MID for its start_num parameter, which will result in:

=0+LEFT(A1,COUNT(0+{"1";"2";"3";"A";"B";"C"}))

Adding the zero to that array gives:

=0+LEFT(A1,COUNT({1;2;3;#VALUE!;#VALUE!;#VALUE!}))

COUNT is a function which usefully ignores error values, and so this is simply:

=0+LEFT(A1,3)

i.e. 123, as required.

Now let’s take a look at how this construction copes when there are additional numbers in the string. For example, with the string “123ABC45″ in A1, we will then have:

=0+LEFT(A1,COUNT(0+MID(A1,{1;2;3;4;5;6;7;8},1)))

which is:

=0+LEFT(A1,COUNT(0+{"1";"2";"3";"A";"B";"C";"4";"5"}))

i.e.:

=0+LEFT(A1,COUNT({1;2;3;#VALUE!;#VALUE!;#VALUE!;4;5}))

and since this resolves to:

=0+LEFT(A1,5)

which is:

=0+"123AB"

we end up with a #VALUE! error. The issue being the fact that additional numbers at the end of the string are equally passed to COUNT, and so the resultant parameter passed to LEFT will always be larger than is necessary.

1.3 LEFT, MATCH and ISNUMBER

Array formula? YES
Works equally if the string also contains numbers at the end? YES
Any strings which cause it to fail? NO

Although this construction contains more function calls than the previous two, the introduction of a MATCH clause necessarily lends it more rigour, and so fully justifies the extra computation in my opinion. The set-up is:

=0+LEFT(A1,MATCH(FALSE,ISNUMBER(0+MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)),0)-1)

We have already derived the MID/ROW construction, so inserting this gives is:

=0+LEFT(A1,MATCH(FALSE,ISNUMBER({1;2;3;#VALUE!;#VALUE!;#VALUE!}),0)-1)

ISNUMBER has the convenient property that it does not itself error when passed an array containing error values, but instead processes those error values as Boolean FALSEs, so this becomes:

=0+LEFT(A1,MATCH(FALSE,{TRUE;TRUE;TRUE;FALSE;FALSE;FALSE},0)-1)

Resolving the MATCH portion gives:

=0+LEFT(A1,4-1)

and so finally:

=0+LEFT(A1,3)

i.e. 123, as desired.

The nice thing about this construction is that, by virtue of this MATCH/ISNUMBER set-up, we can rest assured that additional numbers within the string will not unduly affect the result. For example, passing the string “123ABC45″ to this set-up produces:

=0+LEFT(A1,MATCH(FALSE,ISNUMBER({1;2;3;#VALUE!;#VALUE!;#VALUE!;4;5}),0)-1)

and readers can easily verify that the additional values of 4 and 5 within the MATCH lookup_array in no way affect the output of that function. Hence, the overall result is still 123, as required.

1.4 LEFT, MATCH and ISERR

Array formula? YES
Works equally if the string also contains numbers at the end? YES
Any strings which cause it to fail? NO

This is practically identical to the previous construction, the only difference (and which might be considered an advantage by some) being the small reduction in the total number of characters being used, achieved by the two feats of replacing ISNUMBER with ISERR and coercing the Booleans resulting from MATCH to numericals so that we can use 1 for the lookup_value in place of the ever-so-slightly longer TRUE.

Although seeking to reduce the number of characters in a formula is not something I would recommend just for the sake of it, and especially not if those attempts at abbreviation compromise the functioning of that formula, given a choice between two otherwise equal set-ups it nevertheless makes sense to prefer the shorter of the two.

The version in question is:

=0+LEFT(A1,MATCH(1,0+ISERR(0+MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)),0)-1)

which we know from before will become:

=0+LEFT(A1,MATCH(1,0+ISERR({1;2;3;#VALUE!;#VALUE!;#VALUE!;4;5}),0)-1)

i.e.:

=0+LEFT(A1,MATCH(1,0+{FALSE;FALSE;FALSE;TRUE;TRUE;TRUE;FALSE;FALSE},0)-1)

and it can easily be verified that this results in a final output of 123, as required. And just as with the previous version, it is also not difficult to see that additional numbers within the string will have no detrimental effect on this result.

To conclude then, if I had to nominate one of the above as being the preferred choice, I would probably go for this final version. The set-ups with MATCH, despite consisting of extra function calls compared to the alternatives discussed here, have a considerable advantage in the rigour stakes in that their outputs are unaffected by additional numbers elsewhere within the string.

It is perhaps a shame that a set-up such as the LOOKUP version, which is otherwise brief, non-array and not unduly burdened with extra functions, is not guaranteed to work in all cases. Granted, the strings which cause this set-up to fail make up such a miniscule proportion of all possible strings that we might easily pardon this shortcoming. However, as I said previously, unless we are in a position to know with certainty the precise type and content of all the strings which we will be passing to our construction, it must be concluded that use of this LOOKUP set-up simply offers too much objective risk.

As for the COUNT version, I confess that this would be my choice if I was certain that none of the strings I was dealing with contained any additional numbers, since this set-up is likely to be a touch faster than the longer MATCH constructions.

That just leaves a choice between the two MATCH set-ups. Although there really isn’t much to choose, and despite the fact that I am often hesitant to opt for a construction which relies on an error function (ISERR), especially if there is an alternative available, here I am confident that the error function is both appropriate and risk-free. As such, the small reduction in character length just gives it the edge over the ISNUMBER set-up for me.

That said, I am by no means certain that my logic governing this choice is unquestionable. Nor am I fully confident that there are not alternative constructions which I have not even considered, and suggestions from readers to that extent will be more than welcome.

The fascinating thing about string extractions in Excel is that there is very rarely only one way to go about achieving the result. As such, we often have a plethora of constructions at our disposal, some of which may be wildly different in their make-up from the others.

And if we throw in the fact that new techniques seem to crop up continuously in this area of Excel, then we are left with a rich and exciting field for developing worksheet function capabilities and for honing our skills. And that can only be a good thing!

8 comments

  1. Hi

    It’s the first time I’ve seen the trick with lookup value 1 and negating the search vector.

    Instead of ROW(something) I use COLUMN(1:1).

    =LOOKUP(1,-LEFT(A1,COLUMN(1:1)))
    {=--LEFT(A1,MATCH(FALSE,ISNUMBER(-MID(A1,COLUMN(1:1),1)),0)-1)}
  2. @XLarium: Thanks for your comment and welcome to the site!

    This idea of using COLUMN(1:1) is an interesting one. I’ve seen ROW(A:A) (not such a good idea) used before, but not this variation. Obviously these two have the benefit over constructions such as ROW(1:99) (absolute or relative) which will fail if rows are inserted/deleted within the range.

    My only slight concern is the fact that using COLUMN(1:1) to generate a series of integers effectively means creating an array consisting of 16,384 values, which might be a touch overkill if the function in question only needs to operate over a maximum of, say, 10.

    Having said that, it’s certainly better than using ROW(A:A) – an array of 2^20 values! – and it does have the advantage that it avoids a few extra function calls and is also non-volatile. And this is probably more than enough compensation for the the few extra microseconds in calculation time required.

    So thanks! I may well end up using this construction myself in future! Unless there are some drawbacks you know of that I haven’t mentioned, of course?

    Thanks again!

  3. Hi

    Well, I admit it’s a little overkill. But when you’re not sure what the maximum length will be, it is a simple way coping with that not-knowledge. On the other hand formula constructions using ROW(INDIRECT()) are volatile and also cost some extra (mirco)seconds to compute.

    COLUMN(1:1) wasn’t my idea. I suppose I first saw it on excelformeln.de – a popular German website. Every solution is formula-based. No VBA allowed. A counterpart to your blog. ๐Ÿ™‚

  4. Yes, agreed to a certain extent. I guess it would be interesting to see the difference in computation times, though I imagine even passing an array of 16,384 values to e.g. MID will not cost too much in terms of resource.

    Also, although most functions for dealing with strings don’t error when passed parameters greater than the actual string length, we might not always be so fortunate, and so will either have to define that array precisely, e.g. using ROW(INDIRECT… or else use an error function.

    Still, I think you’ve convinced me that, where this construction can be employed, it should be. So thanks again, both to you and to all at excelformeln.de!

    Talking of which, I like the sound of this site! Will have to brush up on my German and come and visit! ๐Ÿ™‚

    Regards

  5. Actually I may have to reconsider! Having thought about it, this choice of COLUMN(1:1), while guaranteed to work in 99.999% of cases, is nevertheless not guaranteed to work on all strings.

    The reason being of course that COLUMN(1:1) produces an array of integers from 1 to 16,384. However, since a string can have a length of up to twice that number of characters (32,768), this set-up is not certain to work.

    Regards

  6. And what a good article it is, if I might say so. Though I have to say that I might disagree with you re Manual Calculation Mode, which personally I use exclusively, not so that I can pretend to hide the truth of any terribly inefficient, volatile workbooks I’ve designed, but more because I’m so in the habit of pressing a button when I wish to calculate something. I never expected a hand-held calculator to give me an answer until I’d pressed the equals sign… ๐Ÿ™‚

    Anyway, fair point. I was being a touch flippant when I made that comment, and I generally try to avoid volatility at all costs, this ROW/INDIRECT construction being one time that I tend make an exception. What are your thoughts on this ROW/INDEX construction? How does “Volatile at Workbook Open Only” rate with you? What precisely does this mean anyway, in terms of workbook performance?

    Regards

  7. I personally use the ROW(INDEX:INDEX) construction, although I agree that it is more mysterious to your average user than ROW(INDIRECT). Having something recalculating only on open and close that is otherwise untouched is better than having something recalculated at every entry.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s