March | 2012 | BioInCloud

Archive for March, 2012

How to Install an R Package

Posted in R on March 29, 2012| Leave a Comment »

1. Install from bioconductor
$R
>source(http://www.bioconductor.org/biocLite.R)
>biocLite(“DESeq”,lib=”/mnt/galaxyTools/galaxy-central/tools/lpm_tools/my_R_lib”)
This will install the package DESeq to the custom directory

Full arguments:

pkgs
    Character vector of Bioconductor packages to install.
destdir
    File system directory for downloaded packages.
lib
    R library where packages are installed.
http://www.bioconductor.org/install/
2. Install from downloaded source
$ R CMD INSTALL deseq*.tar.gz -l =/mnt/galaxyTools/galaxy-central/tools/lpm_tools/my_R_lib

Full arguments:

R CMD INSTALL [options] [-l lib] pkgs

`pkgs`	a space-separated list with the path names of the packages to be installed.
`lib`	the path name of the R library tree to install to. Also accepted in the form –library=lib. Paths including spaces should be quoted, using the conventions for the shell in use.
`options`	a space-separated list of options through which in particular the process for building the help files can be controlled. Use `R CMD INSTALL --help` for the full current list of options.

http://stat.ethz.ch/R-manual/R-devel/library/utils/html/INSTALL.html
3. Install from CRAN
> install.packages(“DESeq”, lib=”/mnt/galaxyTools/galaxy-central/tools/lpm_tools/my_R_lib”)

Full arguments
install.packages(pkgs, lib, repos = getOption(“repos”),
                 contriburl = contrib.url(repos, type),
                 method, available = NULL, destdir = NULL,
                 dependencies = NA, type = getOption(“pkgType”),
                 configure.args = getOption(“configure.args”),
                 configure.vars = getOption(“configure.vars”),
                 clean = FALSE, Ncpus = getOption(“Ncpus”, 1L),
                 libs_only = FALSE, INSTALL_opts, …)

http://www.inside-r.org/r-doc/utils/install.packages

The corresponding method to load this package
> library(“DESeq”, lib.loc=”/mnt/galaxyTools/galaxy-central/tools/lpm_tools/my_R_lib”)

Full arguments:

library(package, help, pos = 2, lib.loc = NULL,
        character.only = FALSE, logical.return = FALSE,
        warn.conflicts = TRUE, quietly = FALSE,
        keep.source = getOption(“keep.source.pkgs”),
        verbose = getOption(“verbose”))

require(package, lib.loc = NULL, quietly = FALSE,
        warn.conflicts = TRUE,
        keep.source = getOption(“keep.source.pkgs”),
        character.only = FALSE)

http://127.0.0.1:11495/library/base/html/library.html

Read Full Post »

use Perl to change the enviromental variables for current process

Posted in Perl on March 28, 2012| Leave a Comment »

#!/usr/bin/perl
use strict;
use warnings;
my $path = “toolPath”;
print $ENV{‘PATH’},”\n”;
$ENV{‘PATH’} = $ENV{‘PATH’}.”:”.$path;
print $ENV{‘PATH’},”\n”;

It is not possible to change enviromental variables for subsequent processes using Perl.

Read Full Post »

how to add tools to Galaxy?

Posted in Bioinformatics, Linux/Unix on March 23, 2012| Leave a Comment »

1. modify this file: /mnt/galaxyTools/galaxy-central/tool_conf.xml
add tool names here, which will be shown on the tools section of Galaxy home page. For example, LPM_Tools were added and LPM_Tools will be shown as the tool names on Galaxy homepage

2. create a new directory named lpm_tools (name should be consistent with the name shown on the above figure) under /mnt/galaxyTools/galaxy-central/tools

3. put the wrapper files and xml configuration files under /mnt/galaxyTools/galaxy-central/tools/your_tool_directory, such as /mnt/galaxyTools/galaxy-central/tools/lpm_tools

Read Full Post »

how to run i386 programs on x64 linux machines

Posted in Linux/Unix on March 22, 2012| Leave a Comment »

On x64 ubuntu

$sudo apt-get update
$sudo apt-get install ia32-libs

On other systems,

use the corresponding installing methods to install ia32-libs

Read Full Post »

Useful Linux commands

Posted in Linux/Unix on March 22, 2012| Leave a Comment »

1. How to change Linux ls color
$dircolors -p > ~/.dircolors
#change the colors
$ vi ~/.dircolors
#login again to make the change to effect

2. How to copy directory tree from one location to another ?
$find . -name “*.bam” -print | cpio -pd destination_dir
-p triggers passthrough mode, which effectively just copies files from their original location to destination_dir.
-d flag creates all the necessary directories.

3. Usage of xargs

$sudo find . -name “*7.bam”|xargs rm is not working. Here sudo only applies to find, not xargs

-n number of items per line

Not working. xargs -0 uses a null character as item delimiter, but find . -name “*6.bam” uses a newline character as its output delimiter by default

Notworking. xargs uses a witespace as its input item delimiter by default, but find . -name “*6.bam” uses a newline character as its output delimiter

-0 Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are not special (every character is taken literally). Disables the
end of file string, which is treated like any other argument. Useful when input items might contain white space, quote marks, or backslashes. The GNU find -print0 option produces input suitable for this mode. so $find . -name “*6.bam” -print0 |xargs -0 -I {} mv {} ~ works. Or we can change the delimiter of xargs to a newline without using -print0, e.g., $find . -name “*6.bam” |xargs -d “\n” -I {} mv {} ~

{} is the default argument list marker of args, and you can change it to a specific name, such as $find . -name “*6.bam” -print0 |xargs -0 -I files mv files ~
and $find . -name “*6.bam” |xargs -d “\n” -I files mv files ~

The purpose of using {} is to make args accept two or more paremeters, for example, in this command line, $find . -name “*6.bam” |xargs -d “\n” -I files mv files ~, there are three parameters for xargs: mv, files, and ~

find . -type -f
ignore directories, only files

4. Usage of awk
a. Another method to do the same as $find . -name “*6.bam” -print0 |xargs -0 -I files mv files ~ does is $find . -name “*6.bam” |awk ‘{print “mv “$1″ ~”}’|sh

b. Understanding the complicated command: awk ‘!a[$4]++’
Here is an example input:
1 2 3 4
1 2 3 4
1 2 3 5
2 3 5 6
1 2 4 6
For the first line, get the value of a[$4], e.g. a[4]==empty(0), and !a[4]==!0==1, then increment a[4] and now a[4]==1, so the command is equal to awk ‘1’, which prints out the line
For the second line, get the value of a[$4], e.g. a[4]==1, and !a[4]==!1==0, then increment a[4] and now a[4]==2, so the command is equal to awk ‘0’, which will not print out the line
For the third line, get the value of a[$4], e.g. a[5]==empty(0), and !a[5]==!0==1, then increment a[5] and now a[5]==1, so the command is equal to awk ‘1’, which prints out the line

The command line awk ‘!a[$4]++’ will only print out the lines which have the first occurance of $4

5. usage of grep
find a string and print the file name instead of the line contianing the string in the file
sudo grep -R -sl “some string” /path
-l, –files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match. (-l is
specified by POSIX.)

-s, –no-messages
Suppress error messages about nonexistent or unreadable files. Portability note: unlike GNU grep, 7th Edition Unix grep did not conform to POSIX, because it lacked -q
and its -s option behaved like GNU grep’s -q option. USG-style grep also lacked -q but its -s option behaved like GNU grep. Portable shell scripts should avoid both -q
and -s and should redirect standard and error output to /dev/null instead. (-s is specified by POSIX.)

6. Forget to issue sudo to root command
You may meet this problem: You used vi to edit a file and made a lot of changes to the file, but when you wanted to save your changes and exit, you did not have permission to modify the file.
Here is the solution:
press esc and type the following in the vi editor window:
:wq[a space]!sudo[a space]tee[a space]%
The three spaces are required.

for other commands, use sudo !!
!! represents the previous command.

7. How to show the full command line using ps
ps xww

Read Full Post »

java socket communication

Posted in JAVA/J2EE on March 20, 2012| Leave a Comment »

This example demostrates java network programming using socket. The daemon ObjectServer runs on the server, and the client ObjectClient runs on any client and communicate with the server.

ObjectServer.java
import java.io.*;
import java.net.*;
public class ObjectServer {
public static void main(String[] arg) {
  Authorization auth = null;
  boolean verify = false;
  try {
   ServerSocket socketConnection = new ServerSocket(5001);
   while (true){
    System.out.println(“Server Waiting……”);
    Socket pipe = socketConnection.accept();
    ObjectInputStream serverInputStream = new ObjectInputStream(pipe.getInputStream());
    ObjectOutputStream serverOutputStream = new ObjectOutputStream(pipe.getOutputStream());
    auth = (Authorization)serverInputStream.readObject();
    auth.verify(“cai”,”cai”);
    verify = auth.trueorfalse;
    serverOutputStream.writeObject(auth);
    //serverInputStream.close();
    //serverOutputStream.close();
    ///////////////////////////////
    if (verify == true){
     byte[] buffer;
     buffer = new byte[1024];
     BufferedOutputStream out2 = new BufferedOutputStream(pipe.getOutputStream());
     for (int i=0;i<3;i++){
      BufferedInputStream in2 = new BufferedInputStream(new FileInputStream(i+”.txt”));
      File file=new File(i+”.txt”);
      System.out.println(file);
      System.out.println(file.length());//file size
      //BufferedOutputStream out2 = new BufferedOutputStream(pipe.getOutputStream());
      out2.write((“^^^”+file.length()+”^^^”).getBytes());//add head to the file
      int len2 = 0;
      while ((len2 = in2.read(buffer)) > 0){
       out2.write(buffer, 0, len2);
       out2.flush();
       System.out.print(“.”);
      }
      //out2.write(“********”.getBytes());//add the tail to the end of the file
      out2.flush();
      in2.close();
      System.out.println(“\nDone!”);
     }
    }
   }
  }catch(Exception e){
   System.out.println(e);
  }
}
}

Employee.java
import java.io.*;
import java.util.*;

public class Employee implements Serializable {

private int employeeNumber;
private String employeeName;

   Employee(int num, String name) {
      employeeNumber = num;
      employeeName= name;
   }

    public int getEmployeeNumber() {
      return employeeNumber ;
   }

   public void setEmployeeNumber(int num) {
      employeeNumber = num;
   }

   public String getEmployeeName() {
      return employeeName ;
   }

   public void setEmployeeName(String name) {
      employeeName = name;
   }
}

ObjectClient.java
import java.io.*;import java.net.*;
public class ObjectClient {
public static void main(String[] arg) {
  try {
    Employee joe = new Employee(150, “Joe”);
    System.out.println(“employeeNumber= ” + joe.getEmployeeNumber());
    System.out.println(“employeeName= ” + joe.getEmployeeName());
    Socket socketConnection = new Socket(“127.0.0.1”, 11111);
    ObjectOutputStream clientOutputStream = new ObjectOutputStream(socketConnection.getOutputStream());
    ObjectInputStream clientInputStream = new ObjectInputStream(socketConnection.getInputStream());
    clientOutputStream.writeObject(joe);
    joe= (Employee)clientInputStream.readObject();
    System.out.println(“employeeNumber= ” + joe.getEmployeeNumber());
    System.out.println(“employeeName= ” + joe.getEmployeeName());
    clientOutputStream.close();
    clientInputStream.close();
   } catch (Exception e) {
    System.out.println(e);
   }
}
}

Read Full Post »

DNA base count using Hadoop

Posted in Bioinformatics, Hadoop, JAVA/J2EE on March 19, 2012| Leave a Comment »

This example demostrates the usage of Hadoop to count DNA bases.
Notice: The method used in this example is not efficient. This example aims to show as many features of Hadoop as possible. Specificly, this example shows the custom InputFormat and RecordReader, custom partitioner, sort comparitor and grouping comparitor.

The input file for this test is a fasta file:

>a
cGTAaccaataaaaaaacaagcttaacctaattc
>a
cggcGGGcagatcta
>b
agcttagTTTGGatctggccgggg
>c
gcggatttactcCCCCCAAAAANNaggggagagcccagataaatggagtctgtgcgtccaca
gaattcgcacca
>c
gcggatttactcaggggagagcccagGGataaatggagtctgtgcgtccaca
gaattcgcacca
>d
tccgtgaaacaaagcggatgtaccggatNNttttattccggctatggggcaa
ttccccgtcgcggagcca
>d
atttatactcatgaaaatcttattcgagttNcattcaagacaagcttgaca
ttgatctacagaccaacagtacttacaaagaATGCCGaaatttaaaatgtggtcac

There are the following [ATGCatgcNN] bases in this file. We want to count the number of each character ignoring cases. This example will use a customized FastaInputFormat to split the file and use a FastaRecordReader to process each split. The mapper class emitts every single character with count 1 for each character in each fasta record, such as <a,1>, <T,1>,<G,1>,<g,1>. We use 3 reducers, and the partitioner class will partition the output of mapper into 3 reducers:[ATat] to reducer 0, [GCgc] to reducer 1, and others to reducer 2. We use a custom sortComparitor to sort the key-value pairs before they are passed to reducers. The custom comparitor sorts the keys ignoring cases so that the same DNA base with different cases will be next to each other. By default, setSortComparatorClass and setGroupingComparatorClass use the same comparitor, e.g, if we set setSortComparatorClass and leave setGroupingComparatorClass unsetted, setGroupingComparatorClass will use the same comparitor as we set for setSortComparatorClass, but but vice versa is not true, and you can use this example to test it.

pasted below is the code for each file:

FastaInputFormat.java
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.util.LineReader;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
/**
* NLineInputFormat which splits N lines of input as one split.
*
* In many “pleasantly” parallel applications, each process/mapper
* processes the same input file (s), but with computations are
* controlled by different parameters.(Referred to as “parameter sweeps”).
* One way to achieve this, is to specify a set of parameters
* (one set per line) as input in a control file
* (which is the input path to the map-reduce application,
* where as the input dataset is specified
* via a config variable in JobConf.).
*
* The NLineInputFormat can be used in such applications, that splits
* the input file such that by default, one line is fed as
* a value to one map task, and key is the offset.
* i.e. (k,v) is (LongWritable, Text).
* The location hints will span the whole mapred cluster.
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class FastaInputFormat extends FileInputFormat<LongWritable, Text> {
public static final String LINES_PER_MAP =
“mapreduce.input.lineinputformat.linespermap”;

public RecordReader<LongWritable, Text> createRecordReader(
InputSplit genericSplit, TaskAttemptContext context)
throws IOException {
context.setStatus(genericSplit.toString());
return new FastaRecordReader();
}

/**
* Logically splits the set of input files for the job, splits N lines
* of the input as one split.
*
* @see FileInputFormat#getSplits(JobContext)
*/
public List<InputSplit> getSplits(JobContext job)
throws IOException {
List<InputSplit> splits = new ArrayList<InputSplit>();
int numLinesPerSplit = getNumLinesPerSplit(job);
for (FileStatus status : listStatus(job)) {
splits.addAll(getSplitsForFile(status,
job.getConfiguration(), numLinesPerSplit));
}
return splits;
}

public static List<FileSplit> getSplitsForFile(FileStatus status,
Configuration conf, int numLinesPerSplit) throws IOException {
List<FileSplit> splits = new ArrayList<FileSplit> ();
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException(“Not a file: ” + fileName);
}
FileSystem fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in = fs.open(fileName);
lr = new LineReader(in, conf);
Text line = new Text();
int numLines = 0;
long begin = 0;
long length = 0;
int num = -1;
/**
while ((num = lr.readLine(line)) > 0) {
numLines++;
length += num;
if (numLines == numLinesPerSplit) {
if (begin == 0) {
splits.add(new FileSplit(fileName, begin, length – 1,
new String[] {}));
} else {
splits.add(new FileSplit(fileName, begin – 1, length,
new String[] {}));
}
begin += length;
length = 0;
numLines = 0;
}
}
if (numLines != 0) {
splits.add(new FileSplit(fileName, begin, length, new String[]{}));
}
} finally {
if (lr != null) {
lr.close();
}
}
return splits;
*/
long record_length = 0;
int recordsRead = 0;
while ((num = lr.readLine(line)) > 0) {
if (line.toString().indexOf(“>”) >= 0){
recordsRead++;
}
if (recordsRead > numLinesPerSplit){
splits.add(new FileSplit(fileName, begin, record_length, new String[]{}));
begin = length;
record_length = 0;
recordsRead = 1;
}

length += num;
record_length += num;

}
splits.add(new FileSplit(fileName, begin, record_length, new String[]{}));

} finally {
if (lr != null) {
lr.close();
}
}
return splits;
}

/**
* Set the number of lines per split
* @param job the job to modify
* @param numLines the number of lines per split
*/
public static void setNumLinesPerSplit(Job job, int numLines) {
job.getConfiguration().setInt(LINES_PER_MAP, numLines);
}

/**
* Get the number of lines per split
* @param job the job
* @return the number of lines per split
*/
public static int getNumLinesPerSplit(JobContext job) {
return job.getConfiguration().getInt(LINES_PER_MAP, 1);
}
}

FastaRecordReader.java
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.util.LineReader;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.logging.Log;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;
/**
* Treats keys as offset in file and value as line.
*/
public class FastaRecordReader extends RecordReader<LongWritable, Text> {
private static final Log LOG = LogFactory.getLog(FastaRecordReader.class);

private CompressionCodecFactory compressionCodecs = null;
private long start;
private long pos;
private long end;
private LineReader in;
private int maxLineLength;
private LongWritable key = null;
private Text value = null;

FSDataInputStream fileIn;
Configuration job;

public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
job = context.getConfiguration();
this.maxLineLength = job.getInt(“mapred.linerecordreader.maxlength”,
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
final CompressionCodec codec = compressionCodecs.getCodec(file);

// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(split.getPath());
boolean skipFirstLine = false;
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
–start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish “start”.
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end – start));
}
this.pos = start;
}

public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;

/**
while (pos < end) {
newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}

// line too long. try again
LOG.info(“Skipped line of size ” + newSize + ” at pos ” +
(pos – newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
*/
LOG.info(“##########################”);
StringBuilder text = new StringBuilder();
int record_length = 0;
Text line = new Text();
int recordsRead = 0;
while (pos < end) {
key.set(pos);
newSize = in.readLine(line, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength));

if(line.toString().indexOf(“>”) >= 0){
if(recordsRead > 9){//10 fasta records each time
value.set(text.toString());
fileIn.seek(pos);
in = new LineReader(fileIn, job);
return true;
}
recordsRead++;
}

record_length += newSize;
text.append(line.toString());
text.append(“\n”);
pos += newSize;

if (newSize == 0) {
break;
}
}
if (record_length == 0){
return false;
}
value.set(text.toString());
return true;

}

@Override
public LongWritable getCurrentKey() {
return key;
}

@Override
public Text getCurrentValue() {
return value;
}

/**
* Get the progress within the split
*/
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos – start) / (float)(end – start));
}
}

public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}

CountBaseMapper.java
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.logging.Log;

public class CountBaseMapper
extends Mapper<Object, Text, Text, IntWritable>
{

private Text base = new Text();
private IntWritable one = new IntWritable(1);
private static final Log LOG = LogFactory.getLog(CountBaseMapper.class);

public void map(Object key,
Text value,
Context context)
throws IOException, InterruptedException
{
//System.err.println(String.format(“[map] key: (%s), value: (%s)”, key, value));
// break each sentence into words, using the punctuation characters shown
String fasta = value.toString();
String[] lines = fasta.split(“[\\r\\n]+”);
LOG.info(“#############”);
StringBuffer sb = new StringBuffer();
for(int j=1;j<lines.length;j++){
char[] array = lines[j].toCharArray();
for(char c : array){
LOG.info(“>”+new Character(c).toString()+”<“);
base.set(new Character(c).toString());
context.write(base,one);
}
}
/**
StringTokenizer tokenizer = new StringTokenizer(value.toString(), ” \t\n\r\f,.:;?![]'”);
while (tokenizer.hasMoreTokens())
{
// make the words lowercase so words like “an” and “An” are counted as one word
String s = tokenizer.nextToken().toLowerCase().trim();
System.err.println(String.format(“[map, in loop] token: (%s)”, s));
word.set(s);
context.write(word, one);
}
*/
}
}

CountBaseReducer.java
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.commons.logging.LogFactory;
import org.apache.commons.logging.Log;

public class CountBaseReducer
extends Reducer<Text, IntWritable, Text, IntWritable> //the types must corespond to the output of map the output of reduce
{
//private IntWritable occurrencesOfWord = new IntWritable();
private static final Log LOG = LogFactory.getLog(CountBaseReducer.class);

public void reduce(Text key,
Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException
{
LOG.info(“————–reducer—————-“);
int count = 0;
IntWritable out = new IntWritable();
for(IntWritable val:values){
count++;
}
out.set(count);
LOG.info(“<“+key.toString()+”>”);
//key.set(“>”+key.toString()+”<“);
context.write(key,out);

}
}

BasePartitioner.java
import org.apache.hadoop.mapreduce.Partitioner;

/** Partition keys by bases{A,T,G,C,a,t,g,c}. */
public class BasePartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value,
int numReduceTasks) {
String base = key.toString();
if(base.compareToIgnoreCase(“A”) == 0){
return 0;
}else if(base.compareToIgnoreCase(“C”) == 0){
return 1;
}else if(base.compareToIgnoreCase(“G”) == 0){
return 1;
}else if(base.compareToIgnoreCase(“T”) == 0){
return 0;
}else{
return 2;
}
}

}

BaseComparator.java
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class BaseComparator extends WritableComparator {
protected BaseComparator(){
super(Text.class,true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
Text t1 = (Text) w1;
Text t2 = (Text) w2;
//compare bases ignoring cases
String s1 = t1.toString().toUpperCase();
String s2 = t2.toString().toUpperCase();
int cmp = s1.compareTo(s2);
return cmp;
}
}

CountBaseDriver.java
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
public class CountBaseDriver {

/**
* the “driver” class. it sets everything up, then gets it started.
*/
public static void main(String[] args)
throws Exception
{
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2)
{
System.err.println(“Usage: blastdriver <inputFile> <outputDir>”);
System.exit(2);
}
Job job = new Job(conf, “count bases”);
job.setJarByClass(CountBaseMapper.class);
job.setMapperClass(CountBaseMapper.class);
job.setNumReduceTasks(3);
//job.setCombinerClass(CountBaseReducer.class);
job.setReducerClass(CountBaseReducer.class);
job.setInputFormatClass(FastaInputFormat.class);
job.setPartitionerClass(BasePartitioner.class);
job.setSortComparatorClass(BaseComparator.class); //setGroupingComparatorClass will use the same comparitor as setSortComparatorClass by default, so do not need to explicitly set setGroupingComparatorClass, but vice versa is not true. You can change the settings to test it.
//job.setGroupingComparatorClass(BaseComparator.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

compile the course code files (make sure all th erequired jars included in the classpath):
javac -classpath .:/home/ubuntu/hadoop-1.0.1/hadoop-core-1.0.1.jar:/home/ubuntu/hadoop-1.0.1/commons-logging-1.1.1/commons-logging-1.1.1.jar:/home/ubuntu/hadoop-1.0.1/lib/commons-cli-1.2.jar:/home/ubuntu/hadoop-1.0.1/contrib/streaming/hadoop-streaming-1.0.1.jar *java

create the jar:
jar cef CountBaseDriver countbasedriver.jar .

put your input fasta files in input and run hadoop:
bin/hadoop fs -mkdir input
bin/hadoop fs -put input/test_fasta.fasta input
bin/hadoop jar countbasedriver.jar input output

results:

without customized SortComparatorClass

with customized SortComparatorClass

Download the source code jar file

Read Full Post »

Hadoop configuration in pseudo distributed mode

Posted in Hadoop on March 16, 2012| Leave a Comment »

core-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<!– dfs.name.dir and dfs.data.dir cannot be in the same directory–>
<name>dfs.name.dir</name>
<value>/mnt/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/data</value>
</property>
</configuration>

mapred-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/mnt</value>
</property>
</configuration>

masters list the host of Secondary Name Node
masters
localhost

hadoop-env.sh
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/home/ubuntu/jdk1.7.0_03

# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000

# Extra Java runtime options. Empty by default.
# export HADOOP_OPTS=-server

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS”
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS”
export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS”
export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS”
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
# export HADOOP_CLIENT_OPTS

# Extra ssh options. Empty by default.
# export HADOOP_SSH_OPTS=”-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR”

# Where log files are stored. $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/mnt/logs

# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

# host:path where hadoop code should be rsync’d from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1

# The directory where pid files are stored. /tmp by default.
export HADOOP_PID_DIR=/mnt/pids

# A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER

# The scheduling priority for daemon processes. See ‘man nice’.
# export HADOOP_NICENESS=10

Setup passphraseless
$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Format a new distributed-filesystem:
$ bin/hadoop namenode -format

Start the hadoop daemons:
$ bin/start-all.sh

Check if hadoop is running correctly
ps U ubuntu(or your user name)

If the five java processes exist as shown in the above figure, usually Hadoop is working well. (In some cases, the five java processes exist, but hadoop is not working. In this case, check the logs for each process to find out which process has a problem)

Read Full Post »

A complete example to demostrate the integration of Struts2/Spring/Hibernate/Ajax/MySQL

Posted in JAVA/J2EE on March 7, 2012| 6 Comments »

This tutorial shows the integration of Struts2/Spring/Hibernate/Ajax/MySQL by an example. All required jars already come with this tutorial souce code.

Note: Ajax was only tested in Internet Explorer.

Download this tutorial source code

The following figures shows the demostration of usage.

Registration page

Note page (no contents yet)

Note page with contents. Click on the pencil icon, you will be able to make changes using Ajax without refreshing the whole page, and the content of the page is persisted in the underlying MySQL database.

#project directory structure (All required libraries are already loaded.)

Configuration files:

add the org.springframework.web.context.ContextLoaderListener to the web.xml file

web.xml
<?xml version=”1.0″ encoding=”UTF-8″?>
<web-app version=”2.5″ xmlns=”http://java.sun.com/xml/ns/javaee” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd“>
<filter>
<filter-name>struts2</filter-name>
<filter-class>org.apache.struts2.dispatcher.FilterDispatcher</filter-class>
</filter>

<listener>
<listener-class>org.springframework.web.context.ContextLoaderListener</listener-class>
</listener>

<!– To allow session-scoped beans –>
<listener>
<listener-class>org.springframework.web.context.request.RequestContextListener</listener-class>
</listener>

<filter-mapping>
<filter-name>struts2</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
<session-config>
<session-timeout>
30
</session-timeout>
</session-config>
<welcome-file-list>
<welcome-file>index.jsp</welcome-file>
</welcome-file-list>
</web-app>

hibernate.cfg.xml
<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE hibernate-configuration PUBLIC
“-//Hibernate/Hibernate Configuration DTD 3.0//EN”
“http://hibernate.sourceforge.net/hibernate-configuration-3.0.dtd“>
<hibernate-configuration>
<session-factory>
<property name=”hibernate.connection.driver_class”>com.mysql.jdbc.Driver</property>
<property name=”hibernate.connection.url”>jdbc:mysql://localhost:3306/test?characterEncoding=utf8</property>
<property name=”hibernate.connection.username”>root</property>
<property name=”connection.password”>123456</property>
<property name=”connection.pool_size”>1</property>
<property name=”hibernate.dialect”>org.hibernate.dialect.MySQLDialect</property>
<property name=”show_sql”>true</property>
<property name=”hbm2ddl.auto”>create</property>
<!– <property name=”hbm2ddl.auto”>create</property> use this for the first time to create table, then change it to validate or update instead of create–>
<property name=”hbm2ddl.auto”>create</property>
<property name=”connection.autocommit”>true</property>
<mapping resource=”com/wordpress/bioincloud/User.hbm.xml”></mapping>
<mapping resource=”com/wordpress/bioincloud/Note.hbm.xml”></mapping>
</session-factory>
</hibernate-configuration>

struts.xml
<!DOCTYPE struts PUBLIC
“-//Apache Software Foundation//DTD Struts Configuration 2.0//EN”
“http://struts.apache.org/dtds/struts-2.0.dtd“>

<action name=”loginUser” class=”loginActionClass” method=”loginUser”>
<result name = “success” type=”redirect”>/listNote.action</result>
<result name = “input”>/index.jsp</result>
<result name = “error”>/error.jsp</result>
</action>

<action name=”registerUser” class=”registerActionClass” method=”registerUser”>
<!– <result type=”chain”>listUser</result> –>
<result name = “login”>/success_register.jsp</result>
</action>

<action name=”listUser” class=”userActionClass” method=”listUser”>
<result>/listUser.jsp</result>
<result name = “input”>/index.jsp</result>
</action>

<action name=”recoverPassword” class=”recoverPasswordActionClass” method=”recoverPassword”>
<result name = “input”>/recover_password.jsp</result>
<result name = “error”>/error.jsp</result>
<result name = “success”>/recover_password_success.jsp</result>
</action>

<action name=”selectAction” class=”com.wordpress.bioincloud.SelectAction” method=”display”>
<result name=”none”>/select.jsp</result>
</action>

<action name=”resultAction” class=”com.wordpress.bioincloud.SelectAction”>
<result name=”success”>/result.jsp</result>
</action>

<action name=”addNote” class=”addNoteClass” method=”addNote”>
<result name = “list” type=”redirect”>/listNote.action</result>
<result name = “input”>/index.jsp</result>
<result name = “error”>/error.jsp</result>
</action>

<action name=”listNote” class=”listNoteClass” method=”listNote”>
<result name = “list” >/listNote.jsp</result>
<result name = “input”>/index.jsp</result>
</action>

<action name=”logoutUser” class=”logoutActionClass” method=”logoutUser”>
<result name = “input”>/index.jsp</result>
</action>
</package>
</struts>

Hibernate mapping files
There are two database tables, Note and User. The following two xml files map to the two tables respectively.
Note.hbm.xml
<?xml version=”1.0″?>
<!DOCTYPE hibernate-mapping PUBLIC
“-//Hibernate/Hibernate Mapping DTD 3.0//EN”
“http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd“>
<hibernate-mapping>
<class name=”com.wordpress.bioincloud.Note” table=”noteInfo”>
<id name=”noteId” column=”NOTE_ID”>
<generator class=”identity” />
</id>
<property name=”userId” type=”string”>
<column name=”USER_ID” />
</property>
<property name=”title” type=”string”>
<column name=”TITLE” not-null=”true”/>
</property>
<property name=”content” type=”string”>
<column name=”NOTE_CONTENT” length=”25500000″ not-null=”true”/>
</property>

</class>
</hibernate-mapping>

User.hbm.xml
<?xml version=”1.0″?>
<!DOCTYPE hibernate-mapping PUBLIC
“-//Hibernate/Hibernate Mapping DTD 3.0//EN”
“http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd“>
<hibernate-mapping>
<class name=”com.wordpress.bioincloud.User” table=”userInfo”>
<id name=”id” column=”USER_ID”>
<generator class=”identity” />
</id>
<property name=”name” type=”string” >
<column name=”USER_NAME” />
</property>
<property name=”gender” type=”string”>
<column name=”GENDER” />
</property>
<property name=”password” type=”string”>
<column name=”PASSWORD” />
</property>
<property name=”email” type=”string”>
<column name=”EMAIL” />
</property>
</class>
</hibernate-mapping>

Spring configuration
By default the applicationContext.xml file is responsible for doing the Spring bean configuration.
applicationContext.xml
<?xml version=”1.0″ encoding=”UTF-8″?>
<!DOCTYPE beans PUBLIC “-//SPRING//DTD BEAN 2.0//EN” “http://www.springframework.org/dtd/spring-beans-2.0.dtd“>
<!– configure sessionFactory –>
<beans>

<bean id=”sessionFactory” class=”org.springframework.orm.hibernate3.LocalSessionFactoryBean”>
<property name=”configLocation”>
<value>classpath:hibernate.cfg.xml</value>
</property>
</bean>

<!– change to your email account which will be used by Spring to send emails to users to recover users password –>
<property name=”username” value=”email” />
<property name=”password” value=”password” />
<property name=”javaMailProperties”>
<props>
<prop key=”mail.smtps.auth”>true</prop>
<prop key=”mail.smtps.starttls.enable”>true</prop>
<prop key=”mail.smtps.debug”>true</prop>

</props>
</property>
</bean>

</beans>

User Log in process to demostrate the integration of Struts2, Spring, and Hibernate

index.jsp
<%@ page language=”java” pageEncoding=”UTF-8″%>
<%@ taglib uri=”/struts-tags” prefix=”s” %>
<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN”>
<html>
<head>
<title>User Login</title>
</head>

<!–see the following action in struts.xml. This login action will be processed by the class loginActionClass, which is configured in applicationContext.xml. The user input information will be passed to class loginActionClass, and user name and password will be setted by setName and setPassword methods. Also, the userService will be injected by Spring, see LoginAction.java below–>
<!–<action name=”loginUser” class=”loginActionClass” method=”loginUser”>–>
<!–<result name = “success” type=”redirect”>/listNote.action</result>–>
<!–<result name = “input”>/index.jsp</result>–>
<!–<result name = “error”>/error.jsp</result>–>
<!–</action>–>

<s:actionerror/><!– must use it, otherwise message will not show up –>
<s:textfield name=”name” label=”user name”></s:textfield>
<s:textfield name=”password” label=”password”></s:textfield>
<s:submit value=”Login”></s:submit>
</s:form>
<A href=”register.jsp” title=”sign up”>New user sign up</A>
<A href=”recover_password.jsp” title=”recover my password”>Forgot password?</A>
</body>
</html>

LoginAction.java
package com.wordpress.bioincloud;
import com.wordpress.bioincloud.IUserService;
import java.util.Map;
import com.opensymphony.xwork2.ActionContext;
import com.opensymphony.xwork2.ActionSupport;
public class LoginAction extends ActionSupport{
private String name;
private String password;
private IUserService userService;

public String loginUser(){
ActionContext ctx = ActionContext.getContext();
Map session = ctx.getSession();
//create a new session when a user login
session.put(“USER_ID”,getName());
//Spring calls setUserService() method to create a UserService instance, see the following three lines in the Spring configuration file applicationContext.xml
//<bean id=”loginActionClass” class=”com.wordpress.bioincloud.LoginAction” scope=”prototype”>
//<property name=”userService” ref=”userService” />
//</bean>

if (userService.loginUser(name, password).size() > 0){
return “success”;
}else{
return “error”;
}
}

//validate the user input
public void validate(){
if (getName().length() == 0){
this.addFieldError(“name”, “User name is required”);
//addActionError(“User name is required”);
}
if (getPassword().length() == 0){
this.addFieldError(“password”, “Password is required”);
}

else if(userService.loginUser(name, password).size() == 0){
//System.out.println(“************************”);
this.addActionError(“Wrong user name or password, try it again.”);
//System.out.println(“************************”);
}
}
public String getName() {
return name;
}

public void setName(String name) {
this.name = name;
}

public void setPassword(String password) {
this.password = password;
}
public String getPassword() {
return password;
}

public IUserService getUserService(){
return userService;
}

public void setUserService(IUserService userService){
this.userService = userService;
}

}

UserService.java
//This class calls methods in UserDao, which actually operates the underlying database using Hibernate
package com.wordpress.bioincloud;
import java.util.List;
import com.wordpress.bioincloud.User;
import com.wordpress.bioincloud.UserDao;
public class UserService implements IUserService{
private UserDao userDao;
public void addUser(User user){
userDao.addUser(user);
}

public void updateUser(User user){
userDao.updateUser(user);
}
public List<User> listUser(){
return userDao.listUser();
}
public List<User> loginUser(String name, String password){
return userDao.loginUser(name, password);

}
public void setUserDao(UserDao userDao){
this.userDao = userDao;
}

public UserDao getUserDao(){
return userDao;
}

public List<User> recoverPassword(String name, String email){
return userDao.recoverPassword(name,email);
}

}

UserDao.java
package com.wordpress.bioincloud;

import java.util.List;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;
import com.wordpress.bioincloud.IDao;
import com.wordpress.bioincloud.User;
public class UserDao extends HibernateDaoSupport implements IDao{

//add the customer
public void addUser(User user){
getHibernateTemplate().save(user);
}
public void updateUser(User user){
getHibernateTemplate().saveOrUpdate(user);
}
public void registerUser(User user){
getHibernateTemplate().save(user);
}
public List<User> listUser(){
List<User> users = (List<User>) getHibernateTemplate().find(“from User”);
return users;
}
public List<User> loginUser(String name, String password){
return getHibernateTemplate().find(“from User u where u.name='”+name+”‘ and u.password='”+password+”‘”);
}

public List<User> recoverPassword(String name, String email){
return getHibernateTemplate().find(“from User u where u.name='”+name+”‘ and u.email='”+email+”‘”);
}
}

Deploy this demo on Ubuntu
#install Apache tomcat
sudo apt-get install tomcat6

#install mysql
sudo apt-get install mysql-server
#enter password 123456

#run mysql to create a database. Hibernate can create tables automatically (should set <property name=”hbm2ddl.auto”>create</property> in hibernate.cfg.xml, but need to change create to update or validate after tables are created, otherwise the data will lost when deploy the application), but cannot create database automatically.
#create a database test, which shuld be consistent with the database name in hibernate.cfg.xml
mysql -p -u root
mysql> create database test;

#Download this tutorial source code, unzip it, import it into eclipse and edit the configuration file applicationContext.xml to set the email account.

#recompile it by exporting it as a .war file in eclipse.

#copy the recompiled demo cai_ssh_ajax_demo.war to /var/lib/tomcat6/webapps, it will be unpacked automatically

go to http://yourserver:8080/cai_ssh_ajax_demo

Read Full Post »

custom InputFormat with custom RecordReader (new API) for hadoop to process fasta files

Posted in Bioinformatics, Hadoop, JAVA/J2EE on March 6, 2012| Leave a Comment »

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* “License”); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an “AS IS” BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

//package org.apache.hadoop.mapreduce.lib.input;
package org.apache.hadoop.streaming;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

@InterfaceAudience.Public
@InterfaceStability.Stable
public class FastaInputFormat extends FileInputFormat<LongWritable, Text> {
public static final String LINES_PER_MAP =
“mapreduce.input.lineinputformat.linespermap”;

/**
* Logically splits the set of input files for the job, splits N fasta sequences
* of the input as one split.
*
* @see FileInputFormat#getSplits(JobContext)
*/
public List<InputSplit> getSplits(JobContext job)
throws IOException {
List<InputSplit> splits = new ArrayList<InputSplit>();
int numLinesPerSplit = getNumLinesPerSplit(job);
for (FileStatus status : listStatus(job)) {
splits.addAll(getSplitsForFile(status,
job.getConfiguration(), numLinesPerSplit));
}
return splits;
}

public static List<FileSplit> getSplitsForFile(FileStatus status,
Configuration conf, int numLinesPerSplit) throws IOException {
List<FileSplit> splits = new ArrayList<FileSplit> ();
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException(“Not a file: ” + fileName);
}
FileSystem fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in = fs.open(fileName);
lr = new LineReader(in, conf);
Text line = new Text();
int numLines = 0;
long begin = 0;
long length = 0;
int num = -1;
/**
while ((num = lr.readLine(line)) > 0) {
numLines++;
length += num;
if (numLines == numLinesPerSplit) {
// NLineInputFormat uses LineRecordReader, which always reads
// (and consumes) at least one character out of its upper split
// boundary. So to make sure that each mapper gets N lines, we
// move back the upper split limits of each split
// by one character here.
if (begin == 0) {
splits.add(new FileSplit(fileName, begin, length – 1,
new String[] {}));
} else {
splits.add(new FileSplit(fileName, begin – 1, length,
new String[] {}));
}
begin += length;
length = 0;
numLines = 0;
}
}
if (numLines != 0) {
splits.add(new FileSplit(fileName, begin, length, new String[]{}));
}
} finally {
if (lr != null) {
lr.close();
}
}
return splits;
*/
long record_length = 0;
int recordsRead = 0;
while ((num = lr.readLine(line)) > 0) {
if (line.toString().indexOf(“>”) >= 0){
recordsRead++;
}
if (recordsRead > numLinesPerSplit){
splits.add(new FileSplit(fileName, begin, record_length, new String[]{}));
begin = length;
record_length = 0;
recordsRead = 1;
}

length += num;
record_length += num;

}
splits.add(new FileSplit(fileName, begin, record_length, new String[]{}));

} finally {
if (lr != null) {
lr.close();
}
}
//LOG.info(splits.size() + “map tasks”);
return splits;
}

//package org.apache.hadoop.mapreduce.lib.input;
package org.apache.hadoop.streaming;

import java.io.IOException;

/**
* Treats keys as offset in file and value as line.
*/
public class FastaRecordReader extends RecordReader<LongWritable, Text> {
private static final Log LOG = LogFactory.getLog(LineRecordReader.class);

public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(“mapred.linerecordreader.maxlength”,
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
final CompressionCodec codec = compressionCodecs.getCodec(file);

// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
boolean skipFirstLine = false;
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
–start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish “start”.
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end – start));
}
this.pos = start;
}

public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
/**
while (pos < end) {
newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}

int newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
text = value.toString();
if (text.lastIndexOf(“>”) > 0){
break;
}else{
pos += newSize;
}

if (newSize == 0) {
break;
}
// line too long. try again
LOG.info(“Skipped line of size ” + newSize + ” at pos ” + (pos – newSize));
}
return true;
//return false;

}

@Override
public LongWritable getCurrentKey() {
return key;
}

@Override
public Text getCurrentValue() {
return value;
}

/**
* Get the progress within the split
*/
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos – start) / (float)(end – start));
}
}

public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}

Read Full Post »

Older Posts »

BioInCloud

Archive for March, 2012

How to Install an R Package

use Perl to change the enviromental variables for current process

how to add tools to Galaxy?

how to run i386 programs on x64 linux machines

Useful Linux commands

java socket communication

DNA base count using Hadoop

Hadoop configuration in pseudo distributed mode

A complete example to demostrate the integration of Struts2/Spring/Hibernate/Ajax/MySQL

custom InputFormat with custom RecordReader (new API) for hadoop to process fasta files

Recent Posts

Archives

Categories

Meta