Thursday, 11 March 2010

Creating Network Maps

There are many tools that can be used to scan systems and make a network map. The best known of these tools is nmap. Nmap is available from There are many excellent sources of information for the auditor or security professional wanting to discover more about this tool. Other then the section in the Firewall chapter of this book, the following sites should be one of the first stops in this process:

  • ·
  • ·

Though nmap has been ported to Windows, it works best under Linux or UNIX. Too many of the options available within nmap are “broken” by the Microsoft network stack.

We covered using nmap for individual scans in an earlier chapter, “Testing the Firewall”. In this section we look at how to automate the response and make this tool useful for reporting.

The prime limitations with nmap are its reporting capabilities. Nmap does provide output in a “grep’able” format, but there are far more effective tools that can query the data. PBNJ (this package includes ScanPBNJ and OutputPBNJ) can import nmap scan results from an nmap “-oX”, XML format and provides the capability to query this data. The program is written in Perl and provides a means to instantaneously identify changes to the systems and network.

ScanPBNJ can be used directly to scan the network using nmap directly. Using nmap to scan and then import the output into ScanPBNJ requires the use of the nmap XML output format (-oX). ScanPBNJ with the “-x” option can import the results of the nmap XML report.


PBNJ is a suite of tools that provides that capacity to monitor change across a network over time. It has the capacity to save nmap results into a database and check for changes on the target host(s). It saves the details concerning the services running on these hosts as well as the service state. PBNJ can then parse the data from an nmap scan and store the results in the database. PBNJ uses Nmap as a scanning engine. It is available from

The benefits of PBNJ include:

· The ability to configure automated Internal and external Scans,

· A configurable and flexible querying language and alerting system,

· The ability to parse Nmap XML output files

· The ability to access Nmap output using a database (SQLite, MySQL or Postgres),

· The ability to use distributed scanning with separate consoles and scan engines, and

· PBNJ runs on Linux, BSD and Windows (Linux or UNIX are recommended over Windows in this instance).

ScanPBNJ default scan options

By default, ScanPBNJ runs an nmap scan using the command options; “nmap -vv -O -P0 -sS -p 1-1025”. This output is extremely verbose with operating system identification set. It will also not ping host by default. The options above run an nmap SYN scan over TCP ports between 1 and 1025.

It is possible to override the default options in ScanPBNJ using the “-a” switch. For instance to scan all TCP ports on the host the following command could be used;

ScanPBNJ –a “-A –sS –P0 -p 1-65535”

The other options of the previous command include using the SYN scan option, version scanning, not pinging the host and using operating system detection. Any of the standard nmap switches and scan types may be used.


The ability to query the ScanPBNJ results is provided using OutputPBNJ. OutputPBNJ uses a query yaml config file to perform queries against the information collected by ScanPBNJ. OutputPBNJ display the results of the scans using a variety of formats (such as csv, tab and html).

A number of predefined queries have been included with OutputPBNJ. These may be used to query the nmap results. The configuration file “query.yaml” contains default queries that have been defined on the system.

By default, there are only a small number of queries are limited. It is both possible to modify the existing default queries and/or to query the database directly. An ODBC connection to the database could also be used to load data from the database into another tool.

Tuesday, 9 March 2010

A primer for to using the R statistical package?

We can use R for complicated statistical analyses or multiple calculations on a large dataset. It has powerful plotting, graphing, and data visualisation functions that are of professional quality. There are many freely available packages and libraries so that the user doesn’t have to waste time recreating the same functions. It is a fully programmable language, and is capable of connecting to relational database systems. In short, it is a versatile tool for data analysis and data mining, and it is very suitable for data assurance and auditing purpose. This tutorial and complied manual will give you the ‘fishing-rod’, so you can learn R at your own pace without going ‘Arrg’ and pulling your hair out!

Installing and starting R

Go to this webpage and save the install file:

  1. Execute the downloaded file to install R.
  2. Use the following document to get started with R:

o Maindonald, JH (2004) Using R for Data Analysis and Graphics: Introduction, Code and Commentary. Centre for Bioinformation Science, Australian National University.

  1. Read and go through the tutorials in section 1.1, 1.2
  2. Load the data sets which accompany the above reference by doing these

o Click File-> Load work space…

o Choose the file usingR.Rdata, in the same folder as this document

o Now you don’t need to load the data everytime the above reference ask you to.

2 Keeping notes on what you’ve typed

A good habit in using R is keep notes on what you’ve typed and done, and to copy and paste the results onto the notes. You can also make notes next to the R codes to help you remember what they mean. Write notes starting with ‘#’, anything after the ‘#’ on the same line is ignored by the R engine. For example:

# computing the square root of 64

x <- 64 # assigning 64 to x


# > sqrt(64)

#[ 1] 8

3 Getting started and getting help

o Read and go through the tutorials in section 1.3 to 1.6. This will teach you:

o How to start R in windows

o A simple text editing interface to write R code in

o Create a simple graph

o Using in-built documentation systems. You’ll learn about the following help commands:

§ help


§ apropos

§ You can also use the following command to get help, eg.

· ?plot

· ?mean

o Note: There are practical examples at the bottom of each R documentation page. These examples are very good for learning how to use the functions.

o Searchable mail archive of questions and answers:

§ Go to the CRAN webpage:

§ Click ‘Search’ on the manual bar on the left hand side of the page.

§ Click ‘Searchable mail archives’

§ Here you can search for the topics that you want to get more information about. You can also sign yourself up on the mailing list and ask your own questions. However, it is necessary that you search the mail archive so that the question hasn’t been answered in the past, before you post the question. Alternatively, Google search will also serve the same purpose.

4 Quitting the R software

o Typing q() will let you close the software

o You can choose to save the workspace. This will let you come back later with the data that are stored in R still present. However, it is ‘extremely risky’ to store any important data in R workspace, without saving a separate copy of the data elsewhere as a backup.

5 Loading existing R codes

If you want to load a text document with written R code, you can type source(“windows_path_to_your_r_scripts.R”) in the R command prompt to load the file. Typing history() will also give you a list of all the commands you’ve typed into the R command prompt recently.

6 Using R as a fancy calculator

o have a quick look at section 2.1

o Make sure you know what these function does:

o * (multiply), / (divide), ^ (power of), %% (remainder or modulo)

o summary(hills)

o pairs(hills) command, this graph is a bit complicated but it does surely gives you a taste of what R can do!

o For a comprehensive list of operators see this reference

o Emmanuel Paradis, R for Beginners Section 3.5.3, page 25.

o For a list of useful functions to use, see the above reference, Section 3.5.7, page 31.

7 Assigning constant like algebra

o You can assign names to a number, this is usefully since it will help you remember them and recall them quickly. This will assign the number 1 to the symbol x, and the number 2 to y. The arrow ‘<-‘ means assigning values in R. Try typing the following into R:

x <- 1

2 -> y

x + y

o Read: Emmanuel Paradis, R for Beginners, section 2.2, 2.3. You’ll learn more about assigning variables, removing variables, removing objects, and how to get help on R.

8 Different variable names

o Instead of using just x and y, you can name the variables any way you like as long as it is a word, including upper and lower case letters, full stop ‘.’, and underscore ‘_’. The variable must start with an alphabetical letter, and cannot start with a letter or underscore. A good practice is to name them with meaningful names rather than names like p1, p2 etc… Here is a few examples:

§ My_favourite_number <- 42

§ <- 298

o You can also assign a text to the variable such as:

§ <- "fine thanks!"

9 Simple plotting

  • We want to plot these three points on a X-Y coordinate graph: (1,5), (2,6), (3,7), (4,8), (5,9). To do this we need to assign the x-values 1, 2, 3 and the corresponding y-values 5,6,7. First we assign the x-values to as a list of numbers, and repeat for the y-values. Enter the following:

x <- c( 1,2,3,4,5)

y <- c( 5,6,7,8,9)

plot ( x, y)

  • This will give us a plot like this:


  • You can copy & paste the plot anytime by right clicking on the graph and choosing ‘Copy as bitmap’.

10 Simple correlation study

  • We may use the ‘plot’ function to see how two set of variables correlate with each other. We will use Longley’s economic regression data as an example:
  • We expect that the GNP and the Employment rate should be highly correlated. For example:

cor.test(longley$GNP, longley$Employed)

  • The correlation value is 0.9835516, p-value < 0.001
  • Plotting the data

plot(longley$GNP, longley$Employed, main=”GNP vs. Employment rate”)

· We also want to fit a trend line into the plot and save it into the variable z. First we compute the trend line:

z <- lm( longley$Employed ~ longley$GNP)

· Now we draw the line into the plot:



· To see a summary of the linear model:



lm(formula = longley$Employed ~ longley$GNP)


Min 1Q Median 3Q Max

-0.779583 -0.554401 -0.009444 0.343610 1.445943


Estimate Std. Error t value Pr(>|t|)

(Intercept) 51.843590 0.681372 76.09 < 2e-16 ***

longley$GNP 0.034752 0.001706 20.37 8.36e-12 ***


Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6566 on 14 degrees of freedom

Multiple R-Squared: 0.9674, Adjusted R-squared: 0.965

F-statistic: 415.1 on 1 and 14 DF, p-value: 8.363e-12

11 Data with R

· Using the appropriate data structure is like using a well designed Excel spreadsheet with the rows and columns clearly arranged and organised, it allows you to efficiently store and retrieve data and make ‘sense’ out of it. There a few data types that is essential to data analyses, they are: arrays, matrices, multidimensional arrays, lists, and data frame.

· Use the following reference to learn how to use this data structures:

· Wand, M. (2004) Fundamentals of R. A “Hands-On” Tutorial, Department of Statistics, University of New South Wales and [Last accessed: 22-02-08]

· Read ‘Section 4. Data Structures’ of the above material. It is important that you type or copy and paste them into the R command prompt to learn how it works, since this one is a hands-on tutorial.

· After having some fun with the hands-on tutorial, it is important to read more details behind the data structure. Emmanuel Paradis, R for Beginners, Section 3 on how to manage data in R.

· Note that there are several ways to access ‘List’: Here is some more explanation for Lists:

· Logical operators

· One useful function to know is the command which. It gives you the index for the items that you selected, for example:

x <- c( 1,3,5,7, 9, 8,6,4,2,26) # The first half of the list are odd numbers

> x

[1] 1 3 5 7 9 8 6 4 2 26

> x[ x %% 2 == 0 ]

[1] 8 6 4 2 26

> which( x %% 2 == 0 )

[1] 6 7 8 9 10

12 Inputting and Outputting data

· Wand, M. (2004) Fundamentals of R. A “Hands-On” Tutorial, Department of Statistics, University of New South Wales and [Last accessed: 22-02-08]

· Read ‘Section 5. Input and Output’ of the above material. This will teach you how to read and save data. Note: after you have written something to a file using sink(“my_file.txt”), you must close the file using sink(). This will direct all R command prompt’s output back to the screen rather than the file. Otherwise, all your future output will be saved in the file and nothing will be shown on the screen.

  • Other important I/O functions includes:
    • load
    • save
    • write.table
    • cat
    • print
    • paste
    • Use the R help functions (e.g. ?load) to read up on how they can be used.
  • Reading user input
    • You may use the command readline() to read in user input. For example:
    • > your_fav_color <- readline ( "What is your favourite color?")
    • What is your favourite color? red
    • > your_fav_color
    • [1] "red"
    • Use ?readline in the R command prompt to learn more about it if you’re interested.

13 What are some graphs can I draw with R?

13.1 Histogram

  • Plotting a histogram of the areas of the World's Major Landmasses



  • Plotting the histogram for 10,000 randomly generated number following the normal distribution with mean = 0 and standard deviation = 1.

my_normal_distribution <- rnorm (10000, 0, 1)



13.2 Barplot

  • Barplot of the death rates in Virginia (1940)



This is what the data looks like in table VADeaths looks like in table format. To see what is like type VADeaths in the R command prompt:

Rural Male Rural Female Urban Male Urban Female

50-54 11.7 8.7 15.4 8.4

55-59 18.1 11.7 24.3 13.6

60-64 26.9 20.3 37.0 19.3

65-69 41.0 30.9 54.6 35.1

70-74 66.0 54.3 71.1 50.0

13.3 Box-plot

A box-plot is used to visualize the data without knowing beforehand what the distribution of the data looks like. Using the boxplot, we can look at the spread of the data very easily. The boxplot indicates to use the middle ranked data, the median, and also the first, second, third, and fourth quartile. For more detailed explanations, have a look at the Wikipedia page:

  • For example, a biologist want find out which brand of insect spray is the most effective, and performed twelve repeated experiments for each type of insect spray.

boxplot(count ~ spray, data = InsectSprays, xlab="Type of spray", ylab="Insect Count", col = "lightgray", main= “Effectiveness of different type of Inset Sprays”)



1st quartile

2nd quartile

3rd quartile

4th quartile


The vertical axis represents the insect count, and the horizontal axis represents the type of spray used, labelled A to F. For each type of spray you will see a grey box with two lines that looks like whiskers. The thick black line represents the median. The 1st, 2nd, 3rd, and 4th quartiles for the box plot is marked for spray type F. The circle dots represent outliers. You can see from the data that spray ‘C‘, has the lowest median insect count, and therefore, seems to be the most effective. Now we can use this knowledge to plan our next experiment or statistical tests.

For the R code, these are the parameter ‘main’ specifies the title for the graph, ‘xlab’ and ‘ylab’ adjust the horizontal and vertical axis name respectively, and ‘col’ adjust the color of the bar. It is good practise to name the plot that you draw, and label the axes while you’re working, since it is very easy to get them confused. Copy and paste the graph into an electronic journal as the project progresses to help you recall what you are doing later on.

  • This is how the Insect Spray list looks like in R:

# The ‘typeof’ command gives you the type of data structure used to store the raw data.

> typeof(InsectSprays)

[1] "list"

# This gives you all the different count for the insect spray, the type of spray used is stored separately in the list.

> InsectSprays$count

[1] 10 7 20 14 14 12 10 23 17 20 14 13 11 17 21 11 16 14 17 17 19 21 7 13 0

[26] 1 7 2 3 1 2 1 3 0 1 4 3 5 12 6 4 3 5 5 5 5 2 4 3 5

[51] 3 5 3 6 1 1 3 2 6 4 11 9 15 22 15 16 13 10 26 26 24 13

We know that the 26th value in the list is one, since indicated by the [26] symbol at the start of the second row of data. Similarly, 51st value in the list is 3.

# This gives you the type of spay used. Each label corresponds to the same position in the list of counts above.

> InsectSprays$spray

[1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D

[39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F

Levels: A B C D E F

13.4 Plotting data in matrices

Plotting the growth in number of telephone in different countries (in thousands):

  • The function matplot allows you to plot the columns of one matrix against the columns of another.
matplot(rownames(WorldPhones), WorldPhones, pch=rep(21,7), type = "b", log = "y", xlab = "Year", ylab = "Number of telephones (1000's)")



·         The following command allows you to draw the legend.


legend(1951.5, 80000, colnames(WorldPhones), col = 1:6, lty = 1:5, 

       pch = rep(21, 7))


·         The first two numbers represent the x and y co-ordinate for the position of the legend

·         The next input specify the name for each line

·         col specifies the color of each line

·         lty specifies the line type of each line (e.g. dotted, continuous)

·         pch specify the type of ‘dots’ used for each line, the rep command repeats the number 21 seven times.



·         title allows you to specify the heading in another way

title(main = "World phones data: log scale for response")


  • This is what the data looks like:

































































13.5 Locating positions in a graph

Sometimes you want to put the legend or a text in a specific position on the graph. For this you need to use the locator function. To try it out first draw a plot, and then use the locator function. To locate several points, first, mouse over the dot in the middle and click, and then the dot on the top right corner of the graph and click, and on a random position and click. To finish the process, click the stop letter on the top left corner of the R window, and click “stop the locator”. The stop letter is where the file drop down manual usually is. Now use the following commands:

plot ( c(2,3,4), c(10,15,20))

locator() # Click the two spots

# press stop, and then the locator will return something like the following:


[1] 3.005612 3.993211 2.324766


[1] 15.06202 20.00306 18.45253

  • Now you can put a text in the corresponding position that you like:

text(2.324766, 18.45253, “my random spot”)

  • This is what the end result looks like.


13.6 Printing graphs into PDF files

  • Before you can print graphs into a PDF file, you must first open the file. For example:

pdf(“C:/Documents and Settings/All Users/Desktop/my_first_graph.pdf”)

  • This will save the graph onto your Desktop. It is best to name and save the file directly into the folder for your project.

  • Note: In R, you’ll need to use forward slash ‘/’ instead of the Windows normal backward slash ‘\’.

  • Now, all the graphs will be sent directly to your PDF file, instead to the computer screen. Before you can view the file, you must close it properly, otherwise the pdf file would become corrupted.

  • This command will close the file properly!

13.7 More reading about graphs

  • Essential readings:

  • Section 4.1 to 4.4 of Emmanuel Paradis, R for Beginners Here you’ll learn about:

    • How to print graphs into pdf file,

    • How to print layout the graphs so you can put multiple graphs onto the same page or frame

    • A list of useful plotting commands

    • Low level plotting commands

    • Setting other graphical parameters

  • ‘Graphics: An introduction’ page 61-96 of Petra Kuhnert and Bill Venables (2005) An Introduction to R: Software for Statisical Modelling & Computing. Cleveland, Australia.

· Anatomy of a plot

· Q-Q plot

· Density plot

· Time series plot

· Correlation and covariance function plot

· Adding points, text, symbols and lines

· Displaying higher dimension data

· For some extremely in-depth examples of how to format a graph nicely, see section 4.5 and 4.6 of Emmanuel Paradis, R for Beginners, Section 3 on how to manage data in R.

· R graph library:

14 Manipulating data

· Here you’ll learn how to:

o Extracting data in a vector or a list with respect to a certain conditions e.g. odd or even number, greater than a certain values etc…

o Applying a function to a whole list or matrix

o Sorting values in a matrix.

· Wand, M. (2004) Fundamentals of R. A “Hands-On” Tutorial, Department of Statistics, University of New South Wales and [Last accessed: 22-02-08]

· Essential hands-on: Read Section 7. Data manipulation, of the above material. It is important that you type or copy and paste the commands into the R command prompt to learn how it works, since this one is a hands-on tutorial.

  • Extra reading materials for familiarising yourself with data manipulation: Read ‘Manipulating Data’ page 97 to 107 of Petra Kuhnert and Bill Venables (2005) An Introduction to R: Software for Statisical Modelling & Computing. Cleveland, Australia.

15 Missing values, Infinite values, Indefinite values in R

When you are dealing with real world data, it is usual to come across data tables with missing values. Sometimes, you’ll need to work out how many missing values do we have, in order to assess the quality and quantity of the data. Infinite or indefinite values could also occur if we carelessly divide entries by zero, which is sometimes due to a lack of data or incorrect data entry. It is as important to be aware of these values, as to filter them out before we need to do some critical calculations. To filter out the missing values, denoted as NA in R, use the function ‘’. For example:

> my_list <- c( 1,2,3,4, NA, 6, NA, 8,9)

> my_list

[1] 1 2 3 4 NA 6 NA 8 9



> my_list[ == FALSE ]

[1] 1 2 3 4 6 8 9

Similar functions are available for detecting values generated from dividing a number by zero, is.nan, and the infinite values is.infinite. For example:

> x <- c( 135, NA, NaN, Inf)



> is.nan(x)


> is.infinite(x)


16 Lookup tables and assigning a value to a key word

Sometimes, you’ll need to index a lot of client code number with the client names for data analysis use. This is when the lookup tables become extremely handy. For example, you want to store a list of names of your company spies and their code number. Firstly, we build our specialised lookup table:

lookup_table <- new.env( hash=T)

assign ( "James Bond", 7, env=lookup_table)

assign ( "Shaun Connery", 1, env=lookup_table)

assign ( "Mr. Bean", 99, env=lookup_table)

  • Now we can retrieve the associated code number for each person in our agency, or see whether the person also ‘exists’ in our agency as a spy.

> get ( "James Bond", env=lookup_table)

[1] 7

> exists( "Shaun Connery", env=lookup_table)

[1] TRUE

> ls(lookup_table)

[1] "James Bond" "Mr. Bean" "Shaun Connery"

  • If the spy is not efficient, we can lay the spy off:

> rm ( list=c("Mr. Bean"), envir=lookup_table)

> ls(lookup_table)

[1] "James Bond" "Shaun Connery"

We can also use this as a quick lookup table, to check whether a client exists in our database, or to use it to quickly lookup client’s code. This lookup table allows very fast look up by the computer, since it is indexed internally by the computer. It is recommended that if you are looking up a large amount of keys, and do not have a full scale relational database, that you use this data structure. The keys and values stored in the lookup table is not limited to a number or text, it could be an array, matrix, text etc… The use of this is virtually limitless!

17 Writing R functions

· Wand, M. (2004) Fundamentals of R. A “Hands-On” Tutorial, Department of Statistics, University of New South Wales and [Last accessed: 22-02-08]

· Essential reading: Read ‘Section 1. Writing functions’, of the above material. It is important that you type or copy and paste the commands into the R command prompt to learn how it works, since this one is a hands-on tutorial.

· Essential reading: Programming Emmanuel Paradis, R for Beginners

· More on control flow:

o The command ‘break()’ allow the user to break out of the current loop, and the program continues on the first statement outside the inner most loop

o The command ‘next()’ halts the procession in the current iteration of the loop, and advances to the next step in the loop’s index.

o Both ‘break’ and ‘next’ apply to the innermost loop for a nested set of loops.

o Instead of just using ‘if’ statement, you can follow an ‘if’ statement with an ‘else’ statement to control the flow of the program when the condition for the ‘if’ statement do not apply. For example:

§ If ( temperature < 18 )


print (“please turn on the heater”)




print (“please turn off the heater”)


18 Hints and tips on debugging R functions

· R functions can be very hard to debug for a few reasons. Once you have input the function in R, you would not know the function contains a bug unless the function is executed. The R command interpreter doesn’t give you very user friendly information on where the syntax error is, most importantly, it does not tell you which line the error occurred! Therefore, it is up to the programmer to locate where the error is in the code. Here are a few hints and tips on how to debug R programs and functions:

· Test your code bit-by-bit. It is much easier to find out what is the problem with a smaller chunk of code rather than a large chunk. It is also a good habit to test a small chuck of code before writing more programs which depends on it.

· Insert ‘print’ statements at many places, to print out intermediate values within the function. Save all the output from intermediate values into a text file.

· Use a ‘debug’ global variable to control your debugging ‘print’ statements. You can turn off your debugging ‘print’ statements with a simple switch e.g.

debug <-1

X <- 3

X <- (X*X + 1 )*11

If ( debug == 1)


print (x)


· Develop a comprehensive list of test cases – Think of input to the function which could break the program, and test for them. One common example is inputting an empty list or 0. For example, the following program will not work if you put in an empty list. Do you know why it break down?

My_sd <- function( vector_of_number)


My_sd <- sd(vector_of_number)


· Use the wait function, which pause the program until the user press enter to continue the program (Code developed by Matt Wand: :

wait <- function()


cat("Hit Enter to continue\n")

ans <- readline()



· When dealing with large datasets, use a smaller subset to test your program, since running the function on a large dataset requires a much longer time.

· Keep different versions of the code – Save your codes regularly, and save them into different files recording the date and time when you saved it. This will help you trace back your work if you introduced bugs into the program. It also allows you to try out different ways to solve the problem and see which one is better.

Matching values with regular expressions

For example you want to find a particular phrase or pattern in the R data, how would you do it? You’ll use the R regular expression and pattern matching syntax. This syntax is extremely flexible and allows you to match a wide variety pattern with many different set of wildcards.

  • Using the grep fuction to see if the pattern matches

> grep ( "hell", "hello")

[1] 1

> grep ( "sp", "spam")

[1] 1

> grep ( "hello[[:space:]]world", "The computer says hello world")

[1] 1

> grep ( "sam", "spam")


  • Start of sentence, end of sentence

    • The ‘^’ character represents matching the start of a sentence and the ‘$’ character represents matching the end of a sentence. For examples:

> grep ( "^sam", "sam is back")

[1] 1

> grep ( "^sam", "hello sam")


> grep ( "sam$", "hello sam")

[1] 1

> grep ( "sam$", "sam is back")


> grep ( "^sam$", "sam")

[1] 1

> grep ( "^sam$", "hello sam")


  • Using the wildcard character

    • ‘.’ Represents the wild card character

    • ‘*’ represents the character in front is repeated zero or more times

    • ‘+’ represents the character in front is repeated one or more times

    • ‘?’ represents the character in front is repeated zero or one times

> grep ( "a.*c", "a c")

[1] 1

> grep ( "a.+c", "a c")

[1] 1

> grep ( "a.+c", "ac")


> grep ( "a.*c", "ac")

[1] 1

> grep ( "a.?c", "ac")

[1] 1

> grep ( "a.?c", "a c")

[1] 1

> grep ( "a.?c", "a c")


  • Using the number of repeats identifier

    • {n} represents the preceding character is matched exactly n times.

    • {n,} represents the preceding character is matched n or more times.

    • {n,m} represents the preceding character item is matched at least n times, but not more than m times.

> grep ( "ab{2,3}c", "abbc")

[1] 1

> grep ( "ab{2,3}c", "abbbc")

[1] 1

> grep ( "ab{2,3}c", "abbbbc")


> grep ( "ab{2,3}c", "abc")


  • Using the square bracket to match this or that

    • For example,[0123456789] matches any single digit, [abc] matches the letter a, b or c. Unlike the start of sentence character, putting the symbol ‘^’ at the start within the square bracket like this [^abc] matches anything except the characters a, b or c.

  • For more information on how to use regular expressions read the following entries in the R help manual:

    • ?regexp

    • ?grep

    • ?gregexp

19 Statistical testing

19.1 T-test

Now we perform a paired T-test on two group of 10 patients each. Each group is given a different sleeping, and we want to compare the increase in hours of sleep between the two type of drugs.

·         plot(extra ~ group, data = sleep, main=”Compare two type of sleeping pills”)



·         Traditional interface, for performing a paired T-test


t.test(sleep$extra[sleep$group == 1], sleep$extra[sleep$group == 2], paired=TRUE)


        Paired t-test


data:  sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2] 

t = -4.0621, df = 9, p-value = 0.002833

alternative hypothesis: true difference in means is not equal to 0 

95 percent confidence interval:

 -2.4598858 -0.7001142 

sample estimates:

mean of the differences 



·         Formula interface

t.test(extra ~ group, data = sleep, paired=TRUE)

19.2 Analysis of variance (ANOVA)

  • We want to see whether there are any significant differences between the effectiveness of different insect sprays. For this we use the same InsectSpray dataset.

> anova( lm(count ~ spray, data = InsectSprays))

Analysis of Variance Table

Response: count

Df Sum Sq Mean Sq F value Pr(>F)

spray 5 2668.83 533.77 34.702 < 2.2e-16 ***

Residuals 66 1015.17 15.38


Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  • We know from the ANOVA, that there is at least one insect spray that is significantly more effective than at least one other insect spray. We can now do some a posteriori pair-wise statistical tests, together with an appropriate p-value correction such as Bonferroni correction, to find out which spray is significantly more effective than the others. However, from looking at the box-plot, it seems that spray C is the most effective.

· For more information on ANOVA and the InsectSprays example, see ‘Section 5, Statistical analyses with R, Page 55 – 61’, Emmanuel Paradis, R for Beginners

20 Linear Models

· Essential reading: Section 5, Linear (Multiple Regression) Models and Analysis of Variance, page 37 to 49, Maindonald, JH (2004) Using R for Data Analysis and Graphics: Introduction, Code and Commentary. Centre for Bioinformation Science, Australian National University.

21 Installing and loading R package

· See Page 61 – 63, Emmanuel Paradis, R for Beginners

22 List of Useful R resources

· See Page 71 – 72, Emmanuel Paradis, R for Beginners

23 References

· J H Maindonald (2004) Using R for Data Analysis and Graphics: Introduction, Code and Commentary. Centre for Bioinformation Science, Australian National University. and [Last accessed: 29-02-08]

· Emmanuel Paradis, R for Beginners, [Last accessed: 29-02-08]

· R graph library:, [Last accessed: 29-02-08]

· Matt Wand’s Bioinformatics course web page: [Last accessed: 22-02-08]

· Matt Wand (2004) Fundamentals of R. A “Hands-On” Tutorial, Department of Statistics, University of New South Wales and [Last accessed: 22-02-08]

· Wikipedia, article on box plot, [Last accessed: 29-02-08]

Monday, 8 March 2010

Unix Authentication and Validation

There are a variety of ways in which a user can authenticate in UNIX. The two primary differences involve authentication to the operating system against authentication to an application alone. In the case of application such as a window manager (e.g. X-Window), authentication to the application is in fact of indicating to the operating system itself. Additionally, authentication may be divided into both local and networked authentication. In either case, the same applications may provide access to either the local or remote system. For instance, X-Window may be used both as a local window manager and as a means of accessing a remote UNIX system. Additionally, network access tools such as SSH provide the capability of connecting to a remote host but may also connect to the local machine by connecting to either its advertised IP address or the local host ( address.
The UNIX authentication scheme is based on the /etc/passwd file. PAM (pluggable authentication modules) has extended this functionality and allowed for the integration of many other authentication schemes. Pam was first proposed by Sun Microsystems in 1995 and was integrated into Red Hat Linux and the following year. Subsequently, PAM has become the mainstay authentication schema for Linux and many UNIX varieties. PAM has been standardized as a component of the X/Open UNIX standardization process. This resulted in the X/Open Single Sign-on (XSSO) standard. From the auditor’s perspective, PAM However, necessitates a recovery mechanism that needs to be integrated into the operating system in case a difficulty develops in the linker or shared libraries. The auditor also needs to come to an understanding of the complete authentication and authorization methodology deployed on the system. PAM allows for single sign-on across multiple servers. Additionally, there are a large number of plug-ins to PAM that vary in their strength. It is important to assess the overall level of security provided by these and remember that the system is only as secure as the weakest link.
The fallback authentication method for any UNIX system lies with the /etc/passwd (password) file. In modern UNIX systems this will be coupled with a shadow file. The password file contains information about the user, the user ID (UID), the group ID (GID), a descriptor which is generally taken by the name, the user’s home directory and the users default shell.
Figure 1 the /etc/passwd File
The user ID and group ID give the system the information needed to match access requirements. The home directory in the password file is the default directory that a user will be sent to in the case of an interactive login. The shell directive sets the initial shell assigned to the user on login. In many cases a user will be able to change directories or initiate an alternative shell, but this at least sets the initial environment. It is important to remember that the password file is generally world readable. In order to correlate user IDs to user names when looking at directory listings and process listings, the system requires that the password file the access of all (at least in read only mode) by all authenticated users.
The password field of the /etc/passwd file has a historical origin. Before the password and show files were split, hashes would be stored in this file. To maintain compatibility, the same format has been used. In modern systems where the password and shadow files are split, an “x” is used to represent that the system has stalled the password hashes in an alternative file. If there is a blank space instead of the “x” this represents that the account has no password. It is crucial that the auditor validates the authentication method used.
The default shell may be a standard interactive shell, a custom script or application designed to limit the functionality of the user or even a false shell designed to restrict the use and stop interactive logins. False shells are generally used in the case of service accounts. This allows the account to login (such as in the case of “lp” for print services) and complete the task it is assigned. Additionally, users may be configured to run an application. A custom script could be configured to start the application allowing the user limited access to the system and then to log the user at the system when they exit the application. It is important for the auditor to check that breakpoints cannot be set allowing the user to gain an interactive shell. Further, in the case of the application access, it is also important to check that the application does not allow the user to spawn an interactive shell if this is not desired.
As was mentioned above, the majority of modern UNIX systems deploy a shadow file. This file is associated with the password file but unlike the password file should not be accessible (even to read) by the majority of users on the system. The format of this file is:
User Password_Hash Last_Changed Password Policy
This allows the system to match the user and other information in the shadow file to the password file. The password is in actuality a password hash. The reason that this should be protected comes to the reason that the file first came into existence. In the early versions of UNIX there was no shadow file. Being that the password file was world readable, a common attack was to copy the password file and use a dictionary to “crack” the password hashes. By splitting the password and shadow file, the password hash is not available to all users and thus it makes it more difficult for a user to attack the system. The password hash function always creates the same number of characters (this may vary from system to system based on the algorithm deployed, such as MD5, DES etc.).
UNIX systems are characteristically configured to allow zero days between changes and 99,999 days between changes. In effect this means that the password policies are ineffective. The fields that exist in the shadow file are detailed below:
· The username,
· The password Hash,
· The Number of days since 01 Jan 1970 that password was last changed,
· The Number of days that must past before password can be changed,
· The Number of days after which password must be changed,
· The Number of days before expiration that user is warned,
· The Number of days after expiration that account is disabled,
· The Number of days since 01 Jan 1970 that account has been disabled.
Being that the hash function will always create a password hash of the same length, it is possible to restrict logins by changing the password hash variable in the shadow file. For instance, changing the password hash field to something like “No_login” will create a disabled account. As this string is less than the length of the password hash, no password hash could ever be created matching that string. So in this instance we have created an account that is not disabled but will not allow interactive logins.
Many systems also support complex password policies. This information is generally stored in the “password policy” section of the show file. The password policy generally consists of the minimum password age, maximum password age, expiration warning timer, post expiration disable timer, and a count for how many days an account has been disabled. Most system administrators do not know how to interpret the shadow file. As an auditor, knowledge of this information will be valuable. Not only will it allow you to validate password policy information, but it may also help in displaying a level of technical knowledge.
When auditing access rights, it is important to look at both how the user logs in and where they log in from. Always consider the question of whether users should be able to log in to the root account directly. Should they be able to do this across the network? Should they authenticate to the system first and then re-authenticate as root (using a tool such as “su” or “sudo”)? When auditing the system, these are some of the questions that you need to consider.
Many UNIX systems control this type of access using the “/etc/securetty” file. This file includes an inventory of all of the” ttys” used by the system. When auditing the system it is important the first collated a list of all locations that would be considered secure enough to sanction the root user to log into from these points. When testing the system verify that only terminals that are physically connected to the server can log into the system as root. Generally, this means that there is either a serial connection to a secure management server or more likely it means allowing connections only from the roof console itself. It is also important to note that many services such as SSH have their own configurations files which allow all restrict authentication from root users. It is important to check not only the “/etc/securetty” file but any other related configurations files associated with individual applications.
Side note: TTY stands for teletype. Back in the early days of UNIX, one of the standard ways of accessing a terminal was via the teletype service. Although this is one of the many technologies that have faded into obscurity, UNIX was first created in the 1960s and 70s. Many of the terms have come down from those long-distant days.