Myroslav Gusyak

Calculation intervals using R

It is often necessary to analyze data dependent of different group variables. For example, we have the database of IP addresses with the time appeal:

Time IP
06.02.2017 1:50 126.152.136.197
06.02.2017 1:50 126.166.48.245
06.02.2017 1:45 126.255.138.102
06.02.2017 1:27 126.255.138.102
06.02.2017 1:25 126.166.48.245
06.02.2017 0:51 182.251.247.6

We need to find the time difference between same IP addresses. In the end result we need to get a table:

Time difference Number of IP
less than 1 minute 1256
from 1 to 3 minutes 856
from 3 to 30 minutes 723
from 30 to 60 minutes 995
from 1 to 6 hours 12
more than 6 hours 8

Algorithm for this task: find the difference between the same IP, remove unique or first IP addresses, find minimal interval inside same addresses, prepare time groups and build frequency table.

# Loading required packages
  require(readxl)
  require(data.table)
  require(plyr)

# Read data
  ip         <- read_excel('ip.xlsx')
  ip$tmstamp <- strptime(ip$tmstamp, format = '%Y-%m-%d %H:%M:%S')
  df         <- data.table(value = ip$tmstamp, group = ip$`click IP`)

# Find difference between time IP-address in sec
  setkey(df, group)
  df <- df[, diff := c(NA, diff(value)), by = group]

# Convert to minutes
  df$interval <- floor((-1) * df$diff / 60)

# Remove NA (first or unique IP)
  df <- df[complete.cases(df),]

# Find min and max of tha same IP and prepare groups of time intervals
  df_min             <- df[ , min(interval), by = group]
  df_min$gr_interval <- cut(df_min$V1, breaks = c(-Inf, 1, 2, 29, 59, 359, Inf), labels = c('less than 1 minute', 'from 1 to 3 minutes', 'from 3 to 30 minutes', 'from 30 to 60 minutes', 'from 1 to 6 hours', 'more than 6 hours'), right = FALSE)
  df_max             <- df[ , max(interval), by = group]
  df_max$gr_interval <- cut(df_max$V1, breaks = c(-Inf, 1, 2, 29, 59, 359, Inf), labels = c('less than 1 minute', 'from 1 to 3 minutes', 'from 3 to 30 minutes', 'from 30 to 60 minutes', 'from 1 to 6 hours', 'more than 6 hours'), right = FALSE)

# Output results
  result_min <- count(df_min, 'gr_interval')
  result_max <- count(df_max, 'gr_interval')
  head(result_max)
  head(result_min)