Collating from multiple sheets based on conditions 6

Some of us may be familiar with the standard technique using INDEX, SMALL, etc. which, given a single-column or single-row array, we can use to return a list of only those values which satisfy one or more criteria of our choosing.

In a previous post (see here) I outlined a method which, given a range consisting of more than one column, returned a single column consisting of all non-blank entries from that range. It can easily be verified that the single condition within this formula (i.e. that the entry be non-blank) can be extended to multiple criteria and so, effectively, we now have at our disposable the means with which to generate single-column lists from both one- and two-dimensional arrays.

But can we go one further yet again? “Three-dimensional” is the collective term often applied to those formulas in Excel which are capable of operating over not just single columns or rows, nor yet ranges consisting of multiple columns or rows (two-dimensional), but which also function effectively over multiple worksheets.

The set-up laid out here presents a means by which, given one or more worksheets of identical layout, we can create a further, “master” worksheet consisting of only those entries from across all worksheets which satisfy certain criteria.

With VBA, this is a run-of-the-mill task, no more than a handful of lines of code. But can we do it using worksheet formulas alone? The answer is yes, as follows:

Assuming we have the following three sheets from which we wish to collate our data:

Sheet1:

Sheet1

Sheet2:

Sheet2

Sheet3:

Sheet3

we will create the following Master sheet, consisting only of entries from the other three sheets whose value in column D is “Y”:

Master

(You can download the workbook here.)

First we go to Name Manager and define the following:

Name: Sheets
Refers to:

={"Sheet1","Sheet2","Sheet3"}

Name: Arry1
Refers to:

=MMULT(0+(ROW(INDIRECT("1:"&COUNTA(Sheets)))>=TRANSPOSE(ROW(INDIRECT("1:"&COUNTA(Sheets))))),TRANSPOSE(COUNTIF(INDIRECT("'"&Sheets&"'!D2:D10"),"Y")))

(Clearly we would amend both the worksheet list and range (D2:D10) appropriately if required.)

The formula in G1, used to determine the number of expected returns for our Master sheet, is:

=SUMPRODUCT(COUNTIF(INDIRECT("'"&Sheets&"'!D2:D10"),"Y"))

The array formula in A2 is then (edit (23/11/2015): many thanks to Dave Dempsey who spotted an error in the previous formula given here):

=IF(ROWS($1:1)>$G$1,"",INDEX(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!A2:F10"),SMALL(IF(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!D2:D10")="Y",ROW(INDIRECT("1:"&ROWS($A$2:$F$10)))),IFERROR(1+ROWS($1:1)-LOOKUP(ROWS($1:1),1+Arry1),ROWS($1:1))),COLUMNS($A:A)))

We can then copy this across and down as required.

Note that, if the values in column A (Code) across all sheets were in fact unique (or else we had some other column which contained values which had that property), then we would not necessarily have to extend this set-up to every column in our Master sheet. We could use it exclusively to obtain the column A values and then employ a different, less resource-heavy method to obtain the entries in the other columns.

Readers are advised – if they know of this alternative solution, that is – to use that where possible. I will nevertheless use the technique over all columns here as a means of demonstration. Besides, we may of course find ourselves in rare situations where we cannot guarantee the generation of unique identifiers across our worksheets, and so it may be necessary to employ this construction across all columns.

How does it work?

Let’s deal with the relatively straightforward formula in G1 first, used to determine the number of expected returns, which I will give again here:

=SUMPRODUCT(COUNTIF(INDIRECT("'"&Sheets&"'!D2:D10"),"Y"))

In case any readers are unfamiliar with formula techniques for working across multiple worksheets, they should make note of this construction using INDIRECT, as it offers practically the only means by which we can perform such calculations. Resolved, it becomes:

=SUMPRODUCT(COUNTIF(INDIRECT("'"&{"Sheet1","Sheet2","Sheet3"}&"'!D2:D10"),"Y"))

This array of text strings representing our sheet names is then concatenated at both ends, at the back with a syntactically-correct construction representing the range in question (D2:D10) and at the front with a single apostrophe. Although this latter is not strictly necessary should the names of the relevant worksheets contain no spaces, it is nevertheless good practice to form the construction so.

Hence, after these concatenations, we will have:

=SUMPRODUCT(COUNTIF(INDIRECT({"'Sheet1'!D2:D10","'Sheet2'!D2:D10","'Sheet3'!D2:D10"}),"Y"))

and, by virtue of the two facts that COUNTIF is able to operate over 3-D ranges and that SUMPRODUCT is providing the necessary coercion such that INDIRECT will return an array of range references, not just one, this resolves to:

=SUMPRODUCT({3,2,1})

the elements of this array – 3, 2 and 1 – corresponding to the number of entries in D2:D10 containing a “Y” in Sheet1, Sheet2 and Sheet3 respectively. And so this finally resolves to 6, as required.

So let’s now look at the main formula in A2, i.e.:

=IF(ROWS($1:1)>$G$1,"",INDEX(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!A2:F10"),SMALL(IF(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!D2:D10")="Y",ROW(INDIRECT("1:"&$G$1))),IFERROR(1+ROWS($1:1)-LOOKUP(ROWS($1:1),1+Arry1),ROWS($1:1))),COLUMNS($A:A)))

The first IF clause is relatively straightforward, and the reason we went to the effort of having that formula in G1: for formulas beyond the 6th row to which we copy this formula, the IF statement will return TRUE and so the FALSE clause, a rather large and resource-heavy construction, will not even be calculated.

This set-up has obvious advantages over the alternative IFERROR approach, which is often seen, in which Excel is still forced to calculate the array formula even in rows beyond the expected number of returns. Without knowing it, people using this approach are forcing Excel to do a good deal more work than is necessary.

The first part of the main clause that I would like to break down is Arry1, which recall was saved as a Defined Name and whose formula is:

=MMULT(0+(ROW(INDIRECT("1:"&COUNTA(Sheets)))>=TRANSPOSE(ROW(INDIRECT("1:"&COUNTA(Sheets))))),TRANSPOSE(COUNTIF(INDIRECT("'"&Sheets&"'!D2:D10"),"Y")))

This is a construction I like to use which effectively generates running subtotals, and is an alternative to the more standard method which uses a combination of SUBTOTAL and OFFSET.

Let’s look at each of the arrays being passed to MMULT in turn. The first one:

0+(ROW(INDIRECT("1:"&COUNTA(Sheets)))>=TRANSPOSE(ROW(INDIRECT("1:"&COUNTA(Sheets)))))

will resolve to, since there are 3 sheets in our Named Range Sheets:

0+(ROW(INDIRECT("1:"&3))>=TRANSPOSE(ROW(INDIRECT("1:"&3))))

which is:

0+({1;2;3}>=TRANSPOSE({1;2;3}))

i.e.:

0+({1;2;3}>={1,2,3})

Readers unfamiliar with the means by which Excel resolves cases in which the two arrays being processed are orthogonal – i.e. one is a single-column array, the other a single-row array – are advised to go through some of my previous posts, e.g. here, in which I give more detailed explanations on this subject.

Suffice to say that the resolution of this construction, comprising a 3-row-by-1-column array and a 1-row-by-3-column array, will be a 3-row-by-3-column array the entries of which will consist of the 9 Boolean TRUE/FALSE responses to the equality query, i.e.:

0+({TRUE,FALSE,FALSE;TRUE,TRUE,FALSE;TRUE,TRUE,TRUE})

Adding the zero clarifies things somewhat:

{1,0,0;1,1,0;1,1,1}

So that represents the first array being passed to MMULT. What about the second? This is derived from:

TRANSPOSE(COUNTIF(INDIRECT("'"&Sheets&"'!D2:D10"),"Y"))

As we know from our earlier dissection, this will resolve to:

TRANSPOSE({3,2,1}

i.e. {3;2;1}. Hence, our MMULT construction now looks like:

MMULT({1,0,0;1,1,0;1,1,1},{3;2;1})

which is (and here readers will see the reason for my reference to running subtotals) {3;5;6}.

We can now substitute this value for Arry1 into the following clause, which appears twice within our main formula:

INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!A2:F10")

This will become (for the formula in A2, that is: bear in mind that the ROWS($1:1) part, which equals 1, will become, successively, 2, 3, 4, etc. as the formula is copied down to subsequent rows):

INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,{3;5;6}>=1,0))&"'!A2:F10")

and, resolving the MATCH function:

INDIRECT("'"&INDEX(Sheets,1)&"'!A2:F10")

i.e.:

INDIRECT("'"&"Sheet1"&"'!A2:F10")

which is of course:

INDIRECT("'Sheet1'!A2:F10")

And so you can see that, for the formula in A2, we know that we will be indexing Sheet1 for our first return. If we were in fact deconstructing, for example, the formula in A5, not A2, then our MATCH construction would look like:

MATCH(TRUE,Arry1>=ROWS($1:4),0)

the only thing having changed being that we are now referencing ROWS($1:4), not ROWS($1:1). Of course, this would now resolve to:

MATCH(TRUE,{3;5;6}>=4,0)

and, since this is clearly 2, our above construction would this time resolve to an indirect reference to Sheet2, not Sheet1.

This is in essence the crux of this technique: by forming an array of running subtotals consisting of the number of rows which match our criteria (i.e. have an entry of “Y” in column D) from each of the sheets in question, we then compare the relative row number of the cell in which our formula resides to this array so that, effectively, we determine which sheet we are going to be indexing for the formula in any given row.

Let’s now put those parts we have thus far calculated back into the main clause of the formula, so that:

INDEX(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!A2:F10"),SMALL(IF(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!D2:D10")="Y",ROW(INDIRECT("1:"&$G$1))),IFERROR(1+ROWS($1:1)-LOOKUP(ROWS($1:1),1+Arry1),ROWS($1:1))),COLUMNS($A:A))

now looks like:

INDEX(Sheet1!A2:F10,SMALL(IF(Sheet1!D2:D10="Y",ROW(INDIRECT("1:"&$G$1))),IFERROR(1+ROWS($1:1)-LOOKUP(ROWS($1:1),1+Arry1),ROWS($1:1))),COLUMNS($A:A))

I have “resolved” the INDIRECT references here for the purpose of clarity. Technically-speaking, and due to Excel’s peculiarities with regards to this function, if you were to actually follow the steps of this construction using Evaluate Formula, you would notice that e.g.:

INDIRECT("'Sheet1'!D2:D10")

is not actually resolved to the equivalent range reference, i.e. Sheet1!D2:D10, but instead resolves to an array consisting of the individual elements within that range, i.e. {“Y”;0;”Y”;”Y”;0;0;0;0;0} (where the zeroes here could be either genuine zeroes in that range or blanks – this in itself could pose problems in other situations).

This rather idiosyncratic behaviour of INDIRECT is something which readers may wish to investigate further, though I will not dwell on it too much here, suffice to say that it should be borne in mind that my above “resolution” of the INDIRECT constructions to pure range references is not strictly correct.

Looking at the clause which generates the array to be passed to SMALL, i.e.:

IF(Sheet1!D2:D10="Y",ROW(INDIRECT("1:"&G$1)))

we see that this resolves to:

IF({"Y";0;"Y";"Y";0;0;0;0;0}="Y",ROW(INDIRECT("1:"&6)))

which is:

IF({TRUE;FALSE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE},{1;2;3;4;5;6})

i.e.:

{1;FALSE;3;4;FALSE;FALSE;FALSE;FALSE;FALSE}

And here we can see that the non-FALSE entries of 1, 3 and 4 correspond to the relative row numbers from Sheet1 for which the entry in column D is “Y”.

All that is now left to do is to pass this array to SMALL and determine the necessary k parameter. This is actually the most difficult part of the whole set-up, since, unlike when we employ this set-up to an array consisting of a single column, for example, in which case we are able to simply increment the k parameter by 1 for successive rows, here we have to take account of the fact that this parameter will be “reset” to 1 when we get to the Sheet2 returns, and also again when we get to the Sheet3 returns.

The construction which I used for this purpose is the rather non-standard-looking:

IFERROR(1+ROWS($1:1)-LOOKUP(ROWS($1:1),1+Arry1),ROWS($1:1))

To see how this part functions, let’s temporarily fashion an array version of this construction, so that we are seeing how it resolves for not just the formula in A2, but for each of the first six rows to which we are copying this formula, i.e. A2:A7 (we know in any case that the formula will return blanks beyond the sixth row). We can see that the one variable in this construction (the ROWS part) will take a range of values from 1 to 6 (i.e. ROWS($1:1) to ROWS($1:6)).

Effectively then, we will be generating the array resulting from:

IFERROR(1+{1,2,3,4,5,6}-LOOKUP({1,2,3,4,5,6},1+Arry1),{1,2,3,4,5,6})

which resolves to:

IFERROR(1+{1,2,3,4,5,6}-LOOKUP({1,2,3,4,5,6},1+{3;5;6}),{1,2,3,4,5,6})

which is:

IFERROR(1+{1,2,3,4,5,6}-{#N/A,#N/A,#N/A,4,4,6},{1,2,3,4,5,6})

i.e.:

=IFERROR({#N/A,#N/A,#N/A,1,2,1},{1,2,3,4,5,6})

which finally gives us the array {1,2,3,1,2,1}, which, as you can see, is precisely the set of values for the k parameter which we require, i.e. the 1st, 2nd and 3rd “smallest” rows which match our criteria from Sheet1, the 1st and 2nd “smallest” rows which match our criteria from Sheet2 and finally the 1st “smallest” row which matches our criteria from Sheet3,

We are now in a position to put all the parts back together so that the main clause of our formula in A2:

INDEX(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!A2:F10"),SMALL(IF(INDIRECT("'"&INDEX(Sheets,MATCH(TRUE,Arry1>=ROWS($1:1),0))&"'!D2:D10")="Y",ROW(INDIRECT("1:"&$G$1))),IFERROR(1+ROWS($1:1)-LOOKUP(ROWS($1:1),1+Arry1),ROWS($1:1))),COLUMNS($A:A))

becomes:

INDEX(Sheet1!A2:F10,SMALL({1;FALSE;3;4;FALSE;FALSE;FALSE;FALSE;FALSE},1),COLUMNS($A:A))

which is:

INDEX(Sheet1!A2:F10,1,COLUMNS($A:A))

and the COLUMNS($A:A) part is simply there so that, as the formula is dragged across to successive columns, creates the appropriate column_num value for the INDEX function. Here COLUMNS($A:A) is simply equal to 1 and so we have, finally:

INDEX(Sheet1!A2:F10,1,1)

as required.

6 comments

  1. I can make very good use of this formula, so I’m grateful for the post. However, I discovered that I can make the formula fail with an innocent change in the first non-Master sheet (Sheet1 in the example provided in the post).

    Specifically, in the example spreadsheet provided, I can make the formula produce three rows of “#N/A” entries in the Master worksheet if I modify the contents of the first worksheet (Sheet1) as follows:

    Insert three rows of data with no “Y” in column D, anywhere before the last row (which contains a “Y” in column D), so that (a) there are now more rows of data in Sheet1 (namely, 7) than there are rows containing “Y” in the three worksheets combined (that is, 6); and (b) there is now a row of data in Sheet1 beyond the 6th row of data that contains a “Y” in column D.

    In this case, the cells in the first three rows in the Master worksheet, which should contain the three rows of data with “Y” in column D from Sheet1, all contain “#N/A” entries. (The entries from Sheet2 and Sheet3 in the Master worksheet are fine.)

    More generally, I get #N/A errors in each column of the first m rows of the Master worksheet, where m is the number of rows of data in Sheet1 that meet the specified condition (e.g., “Y” in column D), when there is at least one row of data in Sheet1 that (a) meets the specified condition, and (b) lies beyond the nth row of data, where n is the total number of rows of data in all of the worksheets that contains an entry that meets the specified condition.

    (I’m using Excel 2011 on a Mac running El Capitan.)

  2. It looks like I can further generalize the condition where the formula fails, as follows:

    Let n = the total number of rows of data in all of the non-Master worksheets that contain data that meet the specified condition. (In the example posted, the specified condition is that there be a “Y” in column D. There are three rows of data in Sheet1 with “Y” in column D, two in Sheet2, and one in Sheet3, for a total of 6, so n = 6 in that example.)

    Let i_ws be the ith row of data (that is, not counting the row of headers) in worksheet “ws” that meets the specified condition.

    If i_ws > n for any i_ws, then all of the rows from worksheet “ws” in the Master worksheet will have #N/A entries in each column.

    To modify the example spreadsheet to meet the failure condition, do any of the following:
    (1) In Sheet1, insert three rows of data without a “Y” in column D anywhere before the last of the existing rows that has a “Y” in column D.
    (2) In Sheet2, do the same.
    (3) In Sheet3, insert six rows of data without a “Y” in column D before the first row of data.

    In each case, the Master worksheet will have rows of “#N/A” instead of the rows of data from the altered worksheet that contain “Y” in column D, while the data in the Master worksheet from the other, unmodified worksheets are included successfully.

  3. @Dave Dempsey

    Firstly, welcome. Secondly, excellent, lucid post and thirdly, good spot – thanks.

    The value in G1 should not be referenced in the portion:

    ROW(INDIRECT("1:"&$G$1))

    which should of course be:

    ROW(INDIRECT("1:"&ROWS($A$2:$F$10)))

    Of course, just as with the other portions of the formula which contain a reference (either direct or indirect) to the number of rows being queried, the upper row reference (10 here) would need amending as and when required.

    Have corrected in the actual post also.

    Many thanks once again.

    Regards

  4. Modify to just pull rows if cells in F10:F100 are blank. Can I do this by just changing values in the formulas and names? Or would it have to be reworked somehow?

    Thanks!

  5. Hi Matt.

    Yes, you should be able to do precisely that, amending the ranges as appropriate and also the criterion (i.e. replacing “Y” with “” in your case).

    Let me know if you have any problems.

    Regards

  6. Pingback: #Excel Super Links #18 – shared by David Hager | Excel For You

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s