| Title: | String Patterns and Statistical Differences Between Two Groups of Strings |
|---|---|
| Description: | Methods include converting series of event names to strings, finding common patterns in a group of strings, discovering "unique" patterns when comparing two groups of strings as well as the number and starting position of each pattern in each string, obtaining transition matrix, computing transition entropy, statistically comparing the difference between two groups of strings, and clustering string groups. Event names can be any action names or labels such as events in log files or areas of interest (AOIs) in eye tracking research. An R Shiny application is available on GitHub. |
| Authors: | Hui Tang [aut], Norbert J. Pienta [aut], Hui Tang [cre] (Tom) |
| Maintainer: | Hui Tang <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.5.1 |
| Built: | 2026-05-27 09:28:51 UTC |
| Source: | https://github.com/cran/GrpString |
Methods include converting series of event names to strings, finding common patterns in a group of strings, discovering "unique" patterns when comparing two groups of strings as well as the number and starting position of each pattern in each string, obtaining transition matrix, computing transition entropy, statistically comparing the difference between two groups of strings, and clustering string groups.
Event names can be any action names or labels such as events in log files or areas of interest (AOIs) in eye tracking research.
| Package: | GrpString |
| Type: | Package |
| Version: | 0.5.1 |
| Date: | 2026-02-23 |
| License: | GPL-2 |
Some functions have two types, one returning a data frame or a vector and the other exporting one or more than one .txt file to the current directory. The former is a simple version of the functions, while the latter can be considered as a generalized or complex version of the former one. This is because some data sets are large (e.g., many rows or columns), or it helps the users to view and manage results when more than one data set is exported. Examples of these function pairs are EveStr - EveString, CommonPatt - CommonPattern, and PatternInfo - UniPatterns.
In addition, to save the users' effort, the function EveString utilizes an input file (which can be a .txt or .csv file) instead of a data frame. This is because the input data are more convenient to be stored in a .txt or .csv file than in a data frame. We suggest the users to copy the relevant input files (including eve1d.txt and eve1d.csv) to a different directory, because the function exports files to the same directory where the input files locate.
Hui Tang, Norbert J. Pienta
Maintainer: Hui (Tom) Tang <[email protected]>
# Discover common patterns in a group of strings strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") CommonPatt(strs.vec, low = 30)# Discover common patterns in a group of strings strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") CommonPatt(strs.vec, low = 30)
CommonPatt finds common patterns shared by a group of strings.
A common pattern is defined as a substring with the minimum length of three that occurs at least twice among a group of strings.
CommonPatt(strings.vec, low = 10)CommonPatt(strings.vec, low = 10)
strings.vec |
String Vector. |
low |
Cutoff. It is the minimum percentage of the occurrence of patterns that the user specifies. The default value is 10. |
The argument 'low' ranges from 0 to 100 in percentage.
The function returns a data frame containing patterns, lengths and percentages of patterns.
row name - The initial order of substrings, which can be ignored.
Column 1 - Pattern: common pattern.
Column 2 - Freq_total: the overall frequency (times of occurrence) of each pattern.
Column 3 - Percent_total: the ratio of Freq_total to the number of original strings, in percent.
Column 4 - Length: the length (i.e., number of characters) of pattern.
Column 5 - Freq_str: similar to Freq_total; but each pattern is counted only once in a string even if the string contains that pattern multiple times.
Column 6 - Percent_str: similar to Percent; but each pattern is counted only once in a string if this string contains the pattern.
Data is sorted by Length, then Freq_total, in decreasing order.
1. H. Tang; E. Day; L. Kendhammer; J. N. Moore; S. A. Brown; N. J. Pienta. (2016). Eye movement patterns in solving science ordering problems. Journal of eye movement research, 9(3), 1-13.
2. J. J. Topczewski; A. M. Topczewski; H. Tang; L. Kendhammer; N. J. Pienta.(2017). NMR Spectra through the eyes of a student: eye tracking applied to NMR items. Journal of chemical education, 94(1), 29-37.
3. J. M. West; A. H. Haake; E. P. Rozanksi; K. S. Karn. (2006). EyePatterns: Software for identifying patterns and similarities across fixation sequences. In Proceedings of the Symposium on Eye-tracking Research & Applications, ACM Press, New York, 149-154.
# Simple strings, non-default cutoff strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") CommonPatt(strs.vec, low = 30)# Simple strings, non-default cutoff strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") CommonPatt(strs.vec, low = 30)
CommonPattern finds common patterns shared by a group of strings.
It converts patterns back to event names that are added to the common pattern table.
A common pattern is defined as a substring with the minimum length of three that occurs at least twice among a group of strings.
CommonPattern(strings.vec, low = 30, eveChar.df)CommonPattern(strings.vec, low = 30, eveChar.df)
strings.vec |
String vector. |
low |
The lowest cutoff. It is the minimum percentage of the occurrence of patterns that the user specifies. The default value is 30. |
eveChar.df |
Data frame that stores the event name - character conversion key. |
The argument 'low' ranges from 0 to 100 in percentage.
A data frame that contain patterns, lengths, percents of patterns, and converted event names.
row name - The initial order of substrings, which can be ignored.
Column 1 - Pattern: common pattern.
Column 2 - Freq_total: the overall frequency (times of occurrence) of each pattern.
Column 3 - Percent_total: the ratio of Freq_total to the number of original strings, in percent.
Column 4 - Length: the length (i.e., number of characters) of pattern.
Column 5 - Freq_str: similar to Freq_total; but each pattern is counted only once in a string even if the string contains that pattern multiple times.
Column 6 - Percent_str: similar to Percent; but each pattern is counted only once in a string if this string contains the pattern.
Column 7 - Event_name: sequence of event names converted back from pattern string
Data is sorted by Length, then Freq_total, in decreasing order.
1. H. Tang; E. Day; L. Kendhammer; J. N. Moore; S. A. Brown; N. J. Pienta. (2016). Eye movement patterns in solving science ordering problems. Journal of eye movement research, 9(3), 1-13.
2. J. J. Topczewski; A. M. Topczewski; H. Tang; L. Kendhammer; N. J. Pienta.(2017). NMR Spectra through the eyes of a student: eye tracking applied to NMR items. Journal of chemical education, 94(1), 29-37.
3. J. M. West; A. H. Haake; E. P. Rozanksi; K. S. Karn. (2006). EyePatterns: Software for identifying patterns and similarities across fixation sequences. In Proceedings of the Symposium on Eye-tracking Research & Applications, ACM Press, New York, 149-154.
data(eventChar.df) data(str1) s0 <- str1[5:15] CommonPattern(s0, low = 30, eveChar.df = eventChar.df)data(eventChar.df) data(str1) s0 <- str1[5:15] CommonPattern(s0, low = 30, eveChar.df = eventChar.df)
DupRm removes successive duplicated characters in each string in a group.
DupRm(strings.vec)DupRm(strings.vec)
strings.vec |
String Vector. |
Returns a string vector with successive duplicates been removed.
That is, each string in the export vector is "collapsed".
# Simple example dup1 <- "000<<<<<DDDFFF333333qqqqqKKKKK33FFF" dup3 <- "aaBB111^^~~~555667777000000!!!###$$$$$$&&&(((((***)))))@@@@@>>>>99" dup13 <- c(dup1, dup3) DupRm(dup13)# Simple example dup1 <- "000<<<<<DDDFFF333333qqqqqKKKKK33FFF" dup3 <- "aaBB111^^~~~555667777000000!!!###$$$$$$&&&(((((***)))))@@@@@>>>>99" dup13 <- c(dup1, dup3) DupRm(dup13)
A data frame containing event names, There are 45 rows. Each row has 26 event names.
data(event1s.df)data(event1s.df)
A data frame with 45 observations or rows.
The event names are from an eye tracking study. Thus, each event name is actually an area of interst (AOI).
data(event1s.df)data(event1s.df)
A data frame where each element in column event (event name) corresponds to an element in column char (character), which can be a letter, digit, or a special character.
data(eventChar.df)data(eventChar.df)
A data frame with 16 observations on the following 2 variables.
eventa character vector
chara character vector
data(eventChar.df)data(eventChar.df)
EveStr converts event names in a data frame to a string vector. In the data frame, each row, which has the same number of event names, is converted to a string based on the conversion key. A string vector is exported. As a result, in the vector, each converted string has the same length.
EveStr(eveName.df, eveName.vec, char.vec)EveStr(eveName.df, eveName.vec, char.vec)
eveName.df |
Data frame that stores event names to be converted. |
eveName.vec |
Event name vector in a conversion key. |
char.vec |
Character vector in a conversion key. |
The lengths of eveName.vec and char.vec are the same.
Each element (event name) in eveName.vec corresponds to an element (character) in char.vec.
An element in char.vec can be a letter, digit, or a special character.
The function returns a string vector.
# small number of event names event.df <- data.frame(c("aoi_1", "aoi_2"), c("aoi_1", "aoi_3"), c("aoi_3", "aoi_5")) event.name.vec <- c("aoi_1", "aoi_2", "aoi_3", "aoi_4", "aoi_5") label.vec <- c("a", "b", "c", "d", "e") EveStr(event.df, event.name.vec, label.vec) # more event names data(event1s.df) data(eventChar.df) EveStr(event1s.df, eventChar.df$event, eventChar.df$char)# small number of event names event.df <- data.frame(c("aoi_1", "aoi_2"), c("aoi_1", "aoi_3"), c("aoi_3", "aoi_5")) event.name.vec <- c("aoi_1", "aoi_2", "aoi_3", "aoi_4", "aoi_5") label.vec <- c("a", "b", "c", "d", "e") EveStr(event.df, event.name.vec, label.vec) # more event names data(event1s.df) data(eventChar.df) EveStr(event1s.df, eventChar.df$event, eventChar.df$char)
EveString converts event names in a data frame to a string vector.
In the data frame, each row, which can have different number of event names, is converted to a string based on the conversion key. As a result, in the vector, converted strings may have different lengths.
EveString(eveName.file, eveName.vec, char.vec)EveString(eveName.file, eveName.vec, char.vec)
eveName.file |
File that stores event names to be converted. |
eveName.vec |
Vector of event names in a conversion key. |
char.vec |
Characters vector in a conversion key. |
In general, it is not convenient to deal with data frames where different rows have different numbers of elements. Thus, it is easier to use a text file than to use a data frame when storing different numbers of event names in rows. As a result, this function utilizes a .txt or .csv file (for eveName.file) and handles such task to save users' effort.
The function returns a vector containing converted strings that generally have different lengths.
If not all event names are converted to characters, a warning message will be printed out.
eveName.file is the name of a file. Thus quote signs are needed when a file name (and its directory) is directly used in the function.
If the example is used, the eveName.file will be eve1d.txt, which is located in your R library. The users may copy eve1d.txt to a directory that can be easily found.
data(eventChar.df) event1d <- paste(path.package("GrpString"), "/extdata/eve1d.txt", sep = "") EveString(event1d, eventChar.df$event, eventChar.df$char)data(eventChar.df) event1d <- paste(path.package("GrpString"), "/extdata/eve1d.txt", sep = "") EveString(event1d, eventChar.df$event, eventChar.df$char)
The positions of legend and p value in the histogram generated from function StrDif may not be ideal for different (permutations on differences of normalized Levenshtein distances) situations. HistDif customizes the positions of legend and p value in the histogram of the statistical difference of two groups of strings.
HistDif(dif.vec, obsDif, pvalue, o.x = 0.01, o.y = 0, p.x = 0.015, p.y = 0)HistDif(dif.vec, obsDif, pvalue, o.x = 0.01, o.y = 0, p.x = 0.015, p.y = 0)
dif.vec |
Vector containing differences of normalized Levenshtein differences (LD) from the permutation test. |
obsDif |
The "observed" or original difference between between-group and within-group normalized LD. |
pvalue |
p value of the permutation test. |
o.x |
x coordinate of the legend in the histogram, default is 0.01. |
o.y |
y coordinate of the legend in the histogram, default is 0. |
p.x |
x coordinate of the p value in the histogram, default is 0.015. |
p.y |
y coordinate of the p value in the histogram, default is 0. |
The default values of o.y and p.y are 0. They are actually related to the number of permutations (num_perm): o.y is above 0.2 * num_perm, and p.y is below 0.2 * num_perm. If non-default values are used, the values become absolute y coordinates.
# simple example, use the vectors of ld difference values obtained from StrDif strs1.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") strs2.vec <- c("xYZdkfAxDa", "ef1563xy", "BC9Dzy35X", "AkeC1fxz", "65CyAdC", "Dfy3f69k") ld.dif.vec <- StrDif(strs1.vec, strs2.vec, num_perm = 500, p.x = 0.025) HistDif(dif.vec = ld.dif.vec, obsDif = 0.00751, pvalue = 0.35600, o.x = 0.025, p.x = 0.040, p.y = 75)# simple example, use the vectors of ld difference values obtained from StrDif strs1.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") strs2.vec <- c("xYZdkfAxDa", "ef1563xy", "BC9Dzy35X", "AkeC1fxz", "65CyAdC", "Dfy3f69k") ld.dif.vec <- StrDif(strs1.vec, strs2.vec, num_perm = 500, p.x = 0.025) HistDif(dif.vec = ld.dif.vec, obsDif = 0.00751, pvalue = 0.35600, o.x = 0.025, p.x = 0.040, p.y = 75)
Patterns that occur at least 20 percent compared to the number of strings in string group 1. It can be obtained from one of the exported files from CommonPattern(str1).
data(p1_20up)data(p1_20up)
The format is: chr [1:32] "212" "202" "BAB" "D0D" "F0F" "020" "B0B" "010" "404" "C0C" ...
data(p1_20up)data(p1_20up)
Patterns that occur at least 25 percent compared to the number of strings in string group 2. It can be obtained from one of the exported files from CommonPattern(str2).
data(p2_25up)data(p2_25up)
The format is: chr [1:32] "0D0D" "0E0E" "E0E0" "D0D" "E0E" "F0F" "B0B" "0C0" "0D0" ...
data(p2_25up)data(p2_25up)
PatternInfo discovers the starting position of each pattern that occurs first or last as well as the number of patterns in each string.
PatternInfo(patterns, strings, rev = FALSE)PatternInfo(patterns, strings, rev = FALSE)
patterns |
Pattern vector. |
strings |
String vector. |
rev |
Determine whether returning the starting positions of patterns that occur first or last in strings. Default is first. |
Returns a data frame, which contains the length of each string, and the starting position of each pattern in each string.
# simple strings and patterns strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") patts <- c("ABC", "123") PatternInfo(patts, strs.vec) # simple strings and patterns, starting position of last pattern strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") patts <- c("ABC", "123") PatternInfo(patts, strs.vec, rev = TRUE)# simple strings and patterns strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") patts <- c("ABC", "123") PatternInfo(patts, strs.vec) # simple strings and patterns, starting position of last pattern strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") patts <- c("ABC", "123") PatternInfo(patts, strs.vec, rev = TRUE)
A vector containing 45 strings that have different lengths. It also can be obtained in the export file from the example in function EveString.
data(str1)data(str1)
The format is: chr [1:45] "D02F0E20DEDC0C30BDC0E45G050A0B5050A06BG0BA5607BA" ...
data(str1)data(str1)
A vector containing 29 strings that have different lengths.
data(str2)data(str2)
The format is: chr [1:29] "G21A1C14C2D0D21D2123201D23D21234320431212412421AB3EGEGE0E4G4B5G6A" ...
data(str2)data(str2)
StrDif tests whether the difference between two groups of strings is statistically significant or not. The difference is based on normalized Levenshtein distances between strings. A permutation test is used as the statistical method.
StrDif(grp1_string, grp2_string, num_perm = 1000, o.x = 0.01, o.y = 0, p.x = 0.015, p.y = 0)StrDif(grp1_string, grp2_string, num_perm = 1000, o.x = 0.01, o.y = 0, p.x = 0.015, p.y = 0)
grp1_string |
String group (vector) 1. |
grp2_string |
String group (vector) 2. |
num_perm |
Number of permutations. The default is 1000. |
o.x |
x coordinate of the legend in the histogram, default is 0.01. |
o.y |
y coordinate of the legend in the histogram, default is 0. |
p.x |
x coordinate of the p value in the histogram, default is 0.015. |
p.y |
y coordinate of the legend in the histogram, default is 0. |
The default values of o.y and p.y are 0. They are actually related to num_perm: o.y is above 0.2 * num_perm, and p.y is below 0.2 * num_perm. If non-default values are used, the values become absolute y coordinates.
The function generates a histogram that demonstrates the distribution of the differences of LDs, the original difference, and the p value.
The function also returns a vector containing differences of normalized Levenshtein distances (LD). The total number of differences is num_perm (number of permutations).
Differences are calculated by subtracting within-group LD from between-group LD. They range from -1 to 1. The "observed" difference is the difference from the original data set.
1. Because the number of permutations is usually large (default is 1000), and so is the number of elements in the vector returned from the function, it's better for the user to use a vector to store the returned results, instead of printing out directly. See the examples.
2. The positions of legend and p value in the histogram generated from function StrDif may not be ideal for different (permutations on differences of normalized Levenshtein distances) situations. Thus, this package includes another function, HistDif, to customize the positions of legend and p value in the histogram.
3. The time to run this function can be relatively long (from seconds to minutes depending on the number and lengths of strings as well as the computer performance).
4. Acknowledgement: The first version of this function was developed with significant help from Dr. Rhonda DeCook in the Department of Statistics and Actuarial Science at the University of Iowa.
1. H. Tang; J. J. Topczewski; A. M. Topczewski; N. J. Pienta. Permutation Test for Groups of Scanpaths Using Normalized Levenshtein Distances and Application in NMR Questions. In Proceedings of the Symposium on Eye Tracking Research and Applications, Santa Barbara, CA, March 28-30, 2012; ACM Press: New York; pp 169-172.
2. M. Feusner; B. Lukoff. (2008). Testing for statistically significant differences between groups of scan patterns. In Proceedings of the Symposium on Eye-tracking Research & Applications, ACM Press, New York, 43-46.
# simple stings, non-default permutation number and p-value position strs1.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") strs2.vec <- c("xYZdkfAxDa", "ef1563xy", "BC9Dzy35X", "AkeC1fxz", "65CyAdC", "Dfy3f69k") ld.dif.vec <- StrDif(strs1.vec, strs2.vec, num_perm = 500, p.x = 0.025) # longer strings data(str1) data(str2) s1 <- str1[1:6] s2 <- str2[1:6] ld.dif12.vec <- StrDif(s1, s2, num_perm = 500)# simple stings, non-default permutation number and p-value position strs1.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") strs2.vec <- c("xYZdkfAxDa", "ef1563xy", "BC9Dzy35X", "AkeC1fxz", "65CyAdC", "Dfy3f69k") ld.dif.vec <- StrDif(strs1.vec, strs2.vec, num_perm = 500, p.x = 0.025) # longer strings data(str1) data(str2) s1 <- str1[1:6] s2 <- str2[1:6] ld.dif12.vec <- StrDif(s1, s2, num_perm = 500)
StrHclust discovers clusters of the strings in a group.
StrHclust(strings.vec, nclust = 2)StrHclust(strings.vec, nclust = 2)
strings.vec |
String Vector. |
nclust |
Number of clusters. Default is 2. |
Returns a data frame with the specific cluster assigned to each string.
A Hierarchical dendrogram is also exported.
# Simple strings strs3.vec <- c("ABCDdefABCDa", "AC3aABCD", "ACD1AB3", "xYZfgAxZY", "gf56xZYx", "AkfxzYZg") StrHclust(strs3.vec)# Simple strings strs3.vec <- c("ABCDdefABCDa", "AC3aABCD", "ACD1AB3", "xYZfgAxZY", "gf56xZYx", "AkfxzYZg") StrHclust(strs3.vec)
StrKclust discovers clusters of the strings in a group.
StrKclust(strings.vec, nclust = 2, nstart = 1)StrKclust(strings.vec, nclust = 2, nstart = 1)
strings.vec |
String Vector. |
nclust |
Number of clusters. Default is 2. |
nstart |
Number of random data sets chosen to start. Default is 1. |
Returns a data frame with the specific cluster assigned to each string.
A cluster plot is also exported.
# Simple strings strs3.vec <- c("ABCDdefABCDa", "AC3aABCD", "ACD1AB3", "xYZfgAxZY", "gf56xZYx", "AkfxzYZg") StrKclust(strs3.vec)# Simple strings strs3.vec <- c("ABCDdefABCDa", "AC3aABCD", "ACD1AB3", "xYZfgAxZY", "gf56xZYx", "AkfxzYZg") StrKclust(strs3.vec)
TransEntro computes the overall transition entropy of all the strings in a group.
TransEntro(strings.vec)TransEntro(strings.vec)
strings.vec |
String Vector. |
Entropy is calculated using the Shannon entropy formula: -sum(freqs * log2(freqs)). Here, freqs are transition frequencies, which are the values in the normalized transition matrix exported by function TransMx in this package. The formula is equivalent to the function entropy.empirical in the entropy package when unit is set to log2.
Returns a single number.
Strings with less than 2 characters are not included for computation of entropy.
I. Hooge; G. Camps. (2013) Scan path entropy and arrow plots: capturing scanning behavior of multiple observers. Frontiers in Psychology.
# simple strings stra.vec <- c("ABCDdefABCDa", "def123DC", "A", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransEntro(stra.vec)# simple strings stra.vec <- c("ABCDdefABCDa", "def123DC", "A", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransEntro(stra.vec)
TransEntropy computes the transition entropy of each of the strings in a group.
TransEntropy(strings.vec)TransEntropy(strings.vec)
strings.vec |
String Vector. |
Entropy is calculated using the Shannon entropy formula: -sum(freqs * log2(freqs)). Here, freqs are transition frequencies, which are the values in the normalized transition matrix exported by function TransMx in this package. The formula is equivalent to the function entropy.empirical in the entropy package when unit is set to log2.
Returns a number vector.
Strings with less than 2 characters are not included for computation of entropy.
I. Hooge; G. Camps. (2013) Scan path entropy and arrow plots: capturing scanning behavior of multiple observers. Frontiers in Psychology.
# default values stra.vec <- c("ABCDdefABCDa", "def123DC", "A", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransEntropy(stra.vec)# default values stra.vec <- c("ABCDdefABCDa", "def123DC", "A", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransEntropy(stra.vec)
TransInfo discovers transitions of two adjacent characters in strings.
A transition is defined as a substring (in the forward order) with length of 2 characters. It can be considered as a special common pattern (length of 2).
TransInfo(strings.vec, type1 = "letters", type2 = "digits")TransInfo(strings.vec, type1 = "letters", type2 = "digits")
strings.vec |
String Vector. |
type1 |
The first type of transition. Default value is letter. |
type2 |
The second type of transition. Default value is digit. |
The function returns a data frame, which contains the numbers of type1 transition, type2 transition, and transitions belonging to neither type1 nor type2.
Strings with less than 2 characters are not included due to the definition of transition.
1. H. Tang; E. Day; L. Kendhammer; J. N. Moore; S. A. Brown; N. J. Pienta. (2016) Eye movement patterns in solving science ordering problems. Journal of eye movement research, 9(3), 1-13.
2. J. J. Topczewski; A. M. Topczewski; H. Tang; L. Kendhammer; N. J. Pienta.(2017) NMR Spectra through the eyes of a student: eye tracking applied to NMR items. Journal of chemical education, 94(1), 29-37.
# default values strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransInfo(strs.vec) # non-default values str1.vec <- c("ABCABEF", "CDCDAB") TransInfo(str1.vec, type1 = "AB", type2 = "CD")# default values strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransInfo(strs.vec) # non-default values str1.vec <- c("ABCABEF", "CDCDAB") TransInfo(str1.vec, type1 = "AB", type2 = "CD")
TransMx discovers transition matrix of a string vector and the related information.
A transition is defined as a substring (in the forward order) with length of 2 characters. It can be considered as a special common pattern (length of 2).
TransMx(strings.vec)TransMx(strings.vec)
strings.vec |
String Vector. If a string has fewer than 2 characters, that string will be ignored. |
The function returns a list, which contains the transition matrix, the normalized matrix, and the sorted numbers of transitions.
Strings with less than 2 characters are not included due to the definition of transition.
# simple strings strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransMx(strs.vec)# simple strings strs.vec <- c("ABCDdefABCDa", "def123DC", "123aABCD", "ACD13", "AC1ABC", "3123fe") TransMx(strs.vec)
UniPatterns discovers "unique" patterns that are in one group of strings but not the other.
UniPatterns(grp1_pattern, grp2_pattern, grp1_string, grp2_string)UniPatterns(grp1_pattern, grp2_pattern, grp1_string, grp2_string)
grp1_pattern |
Patterns shared by a certain percent of strings in string group 1. |
grp2_pattern |
Patterns shared by a certain percent of strings in string group 2. |
grp1_string |
String group 1. |
grp2_string |
String group 2. |
A (common) pattern is defined as a substring with the minimum length of three that occurs at least twice among a group of strings.
A unique pattern is a pattern that appears in only one of the two groups of strings.
The function exports a data frame that lists unique patterns: column 1 for string group 1; column 2 for string group 2.
PatternInfo,
CommonPatt,
CommonPattern
data(str1) data(str2) data(p1_20up) data(p2_25up) UniPatterns(p1_20up, p2_25up, str1, str2)data(str1) data(str2) data(p1_20up) data(p2_25up) UniPatterns(p1_20up, p2_25up, str1, str2)