6

In a database containing firm and patent class values, I want to calculate the following variables:

Technological abandonment: Number of previously active technological patent classes abandoned annually.

Specifically, I want to create variables that calculate the number of patent classes (variable = class) that the firm has used in the past 3 years (t-3, t-2, and t-1) (min observation of one year prior is acceptable if the firm history initially doesn't have 3 years) but are missing in this year (t) I would like to do the same with a 5 year window as well.

I have a dataset containing millions of rows, so a fast data.table solution is much preferred.

In the following dataset:

df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
                 category = c("A","A","B","C","A","D","F","F","C","A","B"))

The desired outcome would be (for a three year window):

    year        class tech_aband_3
 1: 1979        A     0
 2: 1979        A     0
 3: 1980        B     1
 4: 1980        C     1
 5: 1981        A     2
 6: 1981        D     2
 7: 1982        F     4
 8: 1983        F     3
 9: 1983        C     3
10: 1984        A     3
11: 1984        B     3

Many thanks in advance.

3
  • I do not see (id = gvkey) in the sample data. Please update. Commented Mar 9 at 10:37
  • Thanks for the catch, just updated it. I just meant to clarify I'll need to run it by firm id in the end. Commented Mar 9 at 21:40
  • I'm not clear on why the question was closed. It has been clear enough to draw high quality answers. If the one-sentence reference to the group id was the issue, it is removed. I am respectfully asking the question to be reopened again. Commented Mar 13 at 14:21

2 Answers 2

5

Assuming that all years are represented in the data (if not, you'd need to fill missing years for the following to work), you can try:

library(data.table)  
  
df[, .(category = list(unique(category))), by = year
   ][, tech_aband_3 := lengths(mapply(\(x, y) setdiff(unlist(x), y), 
                                      transpose(shift(list(category), 1:3, fill = first(category[[1]]))), 
                                      category))
     ][, .(category = unlist(category)), by = .(year, tech_aband_3)
       ][ df, on = .(year, category)
       ]

     year tech_aband_3 category
    <num>        <int>   <char>
 1:  1979            0        A
 2:  1979            0        A
 3:  1980            1        B
 4:  1980            1        C
 5:  1981            2        A
 6:  1981            2        D
 7:  1982            4        F
 8:  1983            3        F
 9:  1983            3        C
10:  1984            3        A
11:  1984            3        B
Sign up to request clarification or add additional context in comments.

Comments

3

Here is a method that works with you example data. I can't say how fast it will be with a large data set.

library(data.table)
library(purrr)
df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
                 category = c("A","A","B","C","A","D","F","F","C","A","B"))

GetCount <- function(CurrYear) {
  Prev <- unique(df[(CurrYear - year) <= 3 & (CurrYear - year) > 0, "category"])
  Current <- unique(df[year == CurrYear, "category"])
  return(nrow(Prev[!Current, on = "category"]))
}

YEARS <- unique(df$year)       
COUNTS <- map_dbl(YEARS, GetCount)
YearsCounts <- data.table(year = YEARS, tech_aband_3 = COUNTS)

FINAL <- YearsCounts[df, on = "year"]
FINAL
     year tech_aband_3 category
    <num>        <num>   <char>
 1:  1979            0        A
 2:  1979            0        A
 3:  1980            1        B
 4:  1980            1        C
 5:  1981            2        A
 6:  1981            2        D
 7:  1982            4        F
 8:  1983            3        F
 9:  1983            3        C
10:  1984            3        A
11:  1984            3        B

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.