Simultaneous Locating of First and Last Numbers in a String 23

I was initially debating whether to give this post a more pragmatic title, such as “Extracting Phone Numbers from a String”, that being one of the more common practical applications for the techniques outlined here.

However, the extraction of phone numbers (I’m referring here to that type which employs some form of delimiter, e.g. 1-800-12345, and not that which comprises a non-delimited numerical string, e.g. 180012345, there existing already well-documented formula techniques for the extraction of the latter – although of course the set-up given here will work for those as well) is certainly not the only use for this method, and so, in the end, I chose to go with a less restrictive, more theoretical title.

Indeed, there may be many situations in which we are faced with the problem of determining the positions of both the first and last numbers in a string, most likely with a view to extracting the substring enclosed within those two boundaries.

However (and this is the reason I mentioned delimited phone numbers as an example), unless the desired extraction is made up entirely of numerics we will be unable to apply the usual constructions at our disposal. And attempting to salvage such an approach by first performing one or more substitutions to remove any delimiters from the string would, in most cases (and certainly if the positions of those delimiters are variable), only lead to further complications.

I should first point out that this post deals exclusively with the case in which the desired extraction is the only substring which contains numerics within the string. Of course, this is a prerequisite if we wish to employ a solution, as here, which locates the first and last numerics within that string.

The solution which I will be demonstrating is the following:

=MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),1+MMULT(MATCH("*",T(1/(1+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1))),{1,0}),{1;-1}))

But first, perhaps an example will help clarify the precise matter at hand, and let’s start with a quick refresher on how to approach the task of extracting a numerical string which does not contain any delimiters. Consider the following string in A1:

Account No. 1234567890: requires attention

from which our desired extraction is clearly 1234567890.

And let’s assume that the colon immediately following our numeric is not constant, hence ruling out any solutions which may seek to locate that character within the string.

There are several techniques we could use here, one example being:

=-LOOKUP(1,-(MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))))&"**0"))

Note the necessary catenation of the string "**0" to the end of the values produced by MID, another technique courtesy of Lori which is crucial (yet sadly lacking in practice) if we want to guarantee correct results .

For example, change the string in A1 to e.g.:

Account No. 12 – Jun: requires attention

and the version of the formula without this small amendment, i.e.:

=-LOOKUP(1,-(MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))))))

will return, not 12, but 42167, since the string “12 – Jun” will be interpreted, at the time of writing, as the (numeric) date entry 12/6/2015.

(Readers who copy and paste the string as given here into Excel may still see a result of 12 due to my editor rendering the spaces around the hyphen as non-breaking characters. Delete and re-type those spaces if so.)

Catenation of the string "**0" ensures that strings such as “12 – Jun” become "12 – Jun**0" and so can no longer be interpreted as numerics. Although not appropriate to this post, the same also applies to strings which happen to bear a resemblance to numerics formatted as scientific notation.

Of course, such a string must also possess the property that it leaves any numerics unchanged. And this is the point: the string "**0" is equivalent to "E0", i.e. represents scientific notation with an index of 0, and, since 10^0 is unity, we guarantee that any numbers represented in this form will be unchanged.

Readers wishing to test this hypothesis can easily do so by entering a few simple formulas into the worksheet, e.g.:

=0+(147&"**0")

returns 147, whereas:

=0+(147&"**2")

returns 14700, etc., etc.

Note that this choice of additional string is to be preferred over the equivalent "E0". Using:

=-LOOKUP(1,-(MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))))&"E0"))

on the same string as previously returns 36689, not 12, since the substring "12 – JunE0", by pure misfortune (“Jun” & “E” is still “June”!), happens to represent, when coerced to a numeric, the date 12/06/2000.

What’s more, as Lori points out, “the “E” might also be interpreted differently in different locales”.

Now let’s look at a similar case, but in which the characters of the desired extraction are not all numerical. For example:

Account No. 1-234-5678-90: requires attention

Using our formula as given above, i.e.:

=-LOOKUP(1,-(MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))))&"**0"))

we arrive at a result of 1, not 1-234-5678-90, as required.

And, as I mentioned, unfortunately it is not so simple a matter as to first remove any delimiters, viz e.g.:

=-LOOKUP(1,-(MID(SUBSTITUTE(A1,"-",""),MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))))&"**0"))

Although this seems at first sight to be leading somewhere, in that it returns 1234567890, how do we then decide where to re-insert our delimiters? Perhaps if the string in question were of a fixed format – as admittedly some telephone numbering systems are – then this might be a feasible approach. Even then, however, the matter of multiple REPLACE/SUBSTITUTE functions might not leave us with the most succinct of solutions.

I could go on, but the point of this post, as I mentioned at the outset, is not to debate whether the extraction for a particular string can be achieved using alternative constructions, tailored to the string in question. Instead, by determining the first and last numerics within that string, we guarantee ourselves a solution which will work in all cases, no matter the number or type of delimiter, and without the need to debate multiple replacements and/or substitutions.

We already have a more-than-suitable means with which to determine the position of the first numeric within our string, viz the MIN/FIND set-up used previously, which, in my opinion, is by far the most efficient of the available options.

Unfortunately, it is not so easy to invent an equivalent construction for determining the last number in a string, and so we must look to alternatives. Furthermore, as I alluded to in the title of this post with the word “simultaneous”, it would be useful – in fact, the crux of this solution – if the construction we choose can be used to provide us with not only the last number, but also – with minimal alteration – the first.

Or, put another way: what choices do we have for the construction to be used as MID’s num_chars parameter, viz:

=MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),[some construction])

given that we would like to avoid, if possible, [some construction] consisting of a subtraction operating over two separate clauses, one for determining the position of the first numeric within our string, the other for determining the position of the last?

It might be useful if we first look at some constructions for determining the position of the last numeric within a string, and then see which, if any, of these might also lend themselves to being employed to find the position of the first numeric.

The three main candidates are:

=MATCH(1,0/ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)))
=LOOKUP(1,0/ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))))
=MAX(IF(ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1)))))

the first and last of those committed with CSE.

Obviously the first of those is preferable, since it does not require repetition of the ROW clause. But can such a construction also be used to find the first numeric within a string? If not, can the second? And the third?

Before tackling this question, let’s look at how this problem might be approached upon first inspection. The majority of initial attempts, I imagine, would consist of a subtraction comprising the result of our MIN/FIND clause for determining the position of the first numeric within the string and one of the above constructions for determining the last, viz:

MATCH(1,0/ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)))-MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17))

which would make our overall solution (with CSE):

=MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),1+MATCH(1,0/ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)))-MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)))

Not bad, but can we improve upon this construction for the num_chars parameter?

Well, one such means is via use of a function which I have not yet mentioned, and which consists of a very slight alteration to one of the three suggestions given above for finding the last numeric within a string, viz:

=MAX(IF(ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)),ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1)))))

It being 2015, it could be said that it has now been a sufficient number of years since the release of Excel 2010 that we can start safely recommending post-2007 functions without fear of a large number of Excel users not having access to those functions.

And one such 2010-and-later function is AGGREGATE, which not only offers us a non-CSE alternative to many CSE constructions, but also has some additional benefits that standard CSE formulas cannot replicate.

To clarify, the AGGREGATE-equivalent of the above CSE would be:

=AGGREGATE(14,6,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1)))/ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)),1)

But we are not restricted to just a single value for the first parameter, and we can make use of this property to generate, not merely just the maximum or just the minimum, but an array which consists of both.

As such, we can construct:

=MMULT(AGGREGATE({14,15},6,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1)))/ISNUMBER(0+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1)),1),{1;-1})

which resolves to:

=MMULT({25,13},{1;-1})

the 13 and 25 of course representing the positions of our first and last numerics within the string.

Not a bad construction, taking advantage of the ability to pass an array as AGGREGATE’s first argument in order to “simultaneously” obtain the positions of both the first and last numerics.

And yet it still suffers from the slight drawback that it requires the repetition of the ROW construction. Can we perhaps improve upon this AGGREGATE set-up, and find a construction which does not necessitate such a repeated clause?

The answer is yes, viz the following:

MATCH("*",T(1/(1+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1))),{1,0})

How does it work?

The essence of the idea is similar to that involving AGGREGATE, in that we take a function and pass an array as one of its parameters, the result of one version of that formula giving us the position of the first numeric, the result of the other giving us the last numeric.

With AGGREGATE the logic was self-explanatory. After all, that function already has two clearly-defined parameters, i.e. 14 (SMALL) and 15 (LARGE), which lend themselves perfectly to our situation.

However, with this set-up, i.e. one employing MATCH with an array of values – 1 and 0 – for the match_type parameter, it is not at first sight evident how this could function to suit our needs.

The justification is as follows:

First, let’s insert our values from the MID clause, giving us:

MATCH("*",T(1/(1+{"A";"c";"c";"o";"u";"n";"t";" ";"N";"o";".";" ";"1";"-";"2";"3";"4";"-";"5";"6";"7";"8";"-";"9";"0";" ";"r";"e";"q";"u";"i";"r";"e";"s";" ";"a";"t";"t";"e";"n";"t";"i";"o";"n"})),{1,0})

We will be reciprocating each of the values within this array with unity. This will of course mean that any numerics within that array will remain as such, apart from zero which will result in a #DIV/0! error. Such an error will be problematic later on in the evaluation; hence the addition of 1 to each of the values within this array prior to reciprocation, which guarantees that all numerics within that array are non-zero.

We thus have, after performing this addition of 1 to our values:

MATCH("*",T(1/({#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;2;#VALUE!;3;4;5;#VALUE!;6;7;8;9;#VALUE!;10;1;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!})),{1,0})

And, after reciprocation:

MATCH("*",T({#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;0.5;#VALUE!;0.333333333333333;0.25;0.2;#VALUE!;0.166666666666667;0.142857142857143;0.125;0.111111111111111;#VALUE!;0.1;1;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!}),{1,0})

We now apply the function T to these values, giving:

MATCH("*",{#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;"";#VALUE!;"";"";"";#VALUE!;"";"";"";"";#VALUE!;"";"";#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!;#VALUE!},{1,0})

And perhaps now the logic behind this construction is evident. By first reciprocating the values in our array with unity, we ensure that the only values within that array which do not resolve to an error are those which are numerical.

Then, by applying a suitable function to this new array, we can render all those numericals identical, the resulting array being of a form such that we can pass it to MATCH, using not just one but two different values as that function’s match_type parameter, thus simultaneously generating both of our required values.

A match_type parameter of 1 will give us the position of the last non-error entry within the array; a parameter of 0 will give us the first. Such that the above becomes:

{25,13}

as required.

As previously, we may now pass this array to MMULT to derive, finally, the necessary value for MID’s num_chars parameter.

And so our overall formula is:

=MID(A1,MIN(FIND({1,2,3,4,5,6,7,8,9,0},A1&1/17)),1+MMULT(MATCH("*",T(1/(1+MID(A1,ROW(INDEX(A:A,1):INDEX(A:A,LEN(A1))),1))),{1,0}),{1;-1}))

as required.

Another post to follow shortly. Watch this space!

23 comments

  1. Hello XOR LX

    AGGREGATE({14,15},…)

    Ah, when did you stumble on this construction? I mentioned it a few days back:

    I searched and couldn’t find if anyone else has reported or used this before.

  2. @XOR LX
    How about

    =MATCH(1,IF(ISNUMBER(--MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)),1),{1,-1})

    returns the same array pair {25,13}

  3. @XLarium

    Not quite sure I understand what you’re suggesting there?

    To answer your question, I’ve been using that construction for a few years now. If you want “proof”, there’s this from 14 months ago:

    http://www.mrexcel.com/forum/excel-questions/779701-isolating-numeric-values-character-string.html

    This from 12 months ago:

    http://stackoverflow.com/questions/24087376/why-dont-n-a-and-max-or-min-play-well-with-each-other-and-what-can-i-do-abut

    This from 9 months ago:

    http://www.excelforum.com/excel-formulas-and-functions/1059440-compute-mtbf-by-locating-value-in-rows-above.html

    This from 3 months ago:

    http://www.mrexcel.com/forum/excel-questions/851155-making-unique-values-reducing-into-one-cell.html

    and many more.

    Regards

  4. @XOR LX

    I’m just curios if I was the first since I couldn’t find any prior reference.

    How did you find out?

    In my case I had two separate formulas with the same structure and I just tried this {14.15} and – to my surprise – it worked.

  5. @sam

    Lovely idea, though the results of searching for exact values using an inexact match_type are not always what we would hope.

    For example, what do you get if we change the entry in A1 very slightly to:

    Account No. 1-234-5678-9: requires attention

    i.e. remove the zero?

    Regards

  6. @XLarium

    How did I find out what? That AGGREGATE’s first argument can accept an array?

    But this is almost always the case, no? Why shouldn’t it? After all, we can construct such things as:

    =SUM(0+MID(A1,{1,2,3},{1;2}))

    or:

    =MAX(LOOKUP({1,2},A1:A10,B1:B10))

    The MATCH construction given in this post is another good example.

    In fact, to my knowledge there are very few arguments which do not accept arrays. So I think it wasn’t long at all before I was using AGGREGATE in such a way.

    Regards

  7. @XOR LX

    It looks like I have to try out more function array arguments.

    Just some random thoughts …

    INDIRECT({"C5";"R6C3";"C7";"C8"},{TRUE;FALSE;TRUE;TRUE})
    INDIRECT({"C5";"R6C3";"C7";"C8"},TRUE)
    INDIRECT("C5",{TRUE;FALSE;TRUE;TRUE})
    INDIRECT("C5",{TRUE;TRUE;TRUE})

    is equivalent to

    INDEX(C5:C5,N(IF(1,{1;1;1})))

    or

    =OFFSET(C5,{0;0;0},)

    Which returns the value of C5 three times.

  8. @XOR LX
    For

    Account No. 1-234-5678-9: requires attention

    returns the array pair

    {22,13}

    which is the correct answer no ?

  9. @sam

    The 22nd character in that string is 8, not 9, so if you use that array to generate MID’s num_chars parameter, your final extraction will be:

    1-234-5678

    and not:

    1-234-5678-9

    as required.

    Obviously the correct array pair should be:

    {24,13}

    for this string.

    Regards

  10. @XOR XL

    Try

    =MATCH({2,1},IF(ISNUMBER(--MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)),1),{1,-1})

    returns

    {24,13}

    Regards

    Sam

  11. @XOR LX
    Thanks ! for your kind words

    – Your MATCH(“*”,……) construct reminded me of a small tip I learnt from another Genius like you. His name is Daniel Ferry

    Try Omega it is faster than “*”

    Lets assume in column A you have mixed Data types : Numbers + Text + BLANKS and you need to refer to the last non blank row

    Then you can use LastRow

    =MAX(MATCH(Tokens,A:A))

    Where Tokens

    =CHOOSE({1,2},"Ω",9.9E+307)

    This can then be combined with Index to give us a Dynamic name that works with blanks + mixed data types

    =A1:INDEX(A:A,LastRow)
  12. Also realized that the ISNUMBER check is redundant so the High-Low Binary search can be further reduced to

    =MATCH({2,1},IF(--MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1),1),{1,-1})

    The Binary search options of Match are blind to this error value and see only an array’s of 1 so always see “pseudo sorted” data

  13. @sam

    Indeed! And if we really want we can do away with the IF statement:

    =MATCH({2,1},MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)^0,{1,-1})

    though I would also say that, and as you have probably gathered from my posts, I prefer not to use the “fully volatile” INDIRECT for the ROW construction.

    Re your previous comment, why precisely is the construction with Omega faster?

    I have used that construction many times myself for generating dynamic ranges. One question: how would you amend it so that it also handles (i.e. ignores) any null strings ("") at the end of the range?

    This is of course straightforward if we adopt a different approach for determining the last non-empty cell within a range of mixed datatypes, i.e. one which passes an array of cells to some comparison. But the nice thing about the MATCH approach is that it does not necessitate calculating over all cells passed to it, and so is much, much quicker.

    Can you think of a way to adapt the set-up you give to also handle (ignore) any null strings at the end of the range?

    Regards

  14. @XOR LX

    No idea why the “Ω” is faster – Was surprised when Daniel first pointed it out to me – I did test it to be sure.

    If we assume that Null Strings will always be present at the end of the range (of mixed data types) and to ignore them

    LastRow

    =MEDIAN(MATCH(Tokens,A:A,{1,-1,1})

    Where Tokens

    =CHOOSE({1,2,3},"Ω","*",9.9E+307)
  15. @sam

    Have you thoroughly tested your set-up?

    MEDIAN does not accept error values, so your current set-up requires that there be at least one numeric and at least one numeric at all times.

    In any case, apologies for not being clear, but of course we want a solution which works when null strings are present in any cell(s) within the range, not just those at the very end.

    Regards

  16. @XOR LX

    Is not going to be very fast but the below works fine (array entered)

    =MATCH(2,1/LEN(A:A))

    Can handle null strings any where in the range + can work with just text or just numbers in the range + mixed data types

  17. @sam

    Ah, yes, but of course there are many ways in which we can determine this value if we allow ourselves a set-up such as that. This is precisely what I meant when I said:

    This is of course straightforward if we adopt a different approach for determining the last non-empty cell within a range of mixed datatypes, i.e. one which passes an array of cells to some comparison. But the nice thing about the MATCH approach is that it does not necessitate calculating over all cells passed to it, and so is much, much quicker.

    I was more wondering whether you could think of a way which avoids such array-processing, resource-heavy constructions. Or does this seemingly insignificant difference in considering null strings destroy all hope of achieving such a solution?

    Regards

  18. Here is how you would do the sample extractions in Google Sheets

    Account No. 1234567890: requires attention

    =REGEXEXTRACT(A1,"\d+")

    Account No. 1-234-5678-90: requires attention

    =REGEXEXTRACT(A1,"[\d-]+")
  19. @XOR LX

    For some reason this comment did not get posted

    The Null Strings do create a problem. However the best that we can do is to improve the performance by making 1/LEN(A:A) look at a smaller Range like so

    define LstRow

    =MATCH("Ω",$A:$A)

    define Rng

    =$A$1:INDEX($A:$A,LstRow)

    define LstRowExclNull

    =MATCH(2,1/LEN(Rng)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s