Des Profundis...: May 2010

Thursday, 20 May 2010

Quo Vadis with Software Patent

After I read this news, I had to sit: German High court ruled in favor of software patents.. Unfortunately, this is not an April joke. However, the strange thing is that it does seem like there is already a patent for the technology provided, and that even if it should be obvious that the technology described in the patent existed already at the time of the patent submission.

Privacy on Facebook

Thanks to a friend, I heard about the site: http://www.reclaimprivacy.org. It contains certain useful information on privacy settings onf Facebook. For instance, there is an interesting page on the privacy on facebook (hosted by the New York Times).

In order to check your own privacy on Facebook, you can follow the instructions on the website:

This website provides an independent and open tool for scanning your Facebook privacy settings. The source code and its development will always remain open and transparent.

Note: we are still working on privacy scans for your photos and status updates. The tool does not check these yet, so stay tuned for updates!

1. Drag this link to your web browser bookmarks bar: Scan for Privacy

2. Go to your Facebook privacy settings and then click that bookmark once you are on Facebook.

3. You will see a series of privacy scans that inspect your privacy settings and warn you about settings that might be unexpectedly public.

4. Follow us on Facebook to hear about the latest updates.

Monday, 10 May 2010

New Advances in Neural Networks

There is a great google talk about recent advances in pattern recognition in Neural Network. It is given by Goeff Hinton. The title of the talk is Recent Developments in Deep Learning.

Thursday, 6 May 2010

AXIOM - an Apache Stax Parser

I will have to take a look at Axiom which provides a Stax implementation to access XML info sets. It was developed for Axis 2. But it can be used independantly.

Apache Tika - Content and Metadata Extraction in Java

Apache Tika is an useful tool to extract text and metadata from a number of formats.

For example, you have a document pdf, doc,... on the web from which you wish to extract some part. Then you can use tika to extract some part. For this you can use tika:

curl http:urltodoc/.../document.pdf | java -jar tika-app/target/tika-app-0.7.jar --text

produces the text of the document. Other options exist to return an html, an xhtml or only the metadata of the document.

Maven Integration

As for other maven projects, you can specify the dependency in the pom. Note however, that depending on your needs, you might want to specify one of these ( mostly quoted from this page):

tika-core/target/tika-core-0.7.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.
tika-parsers/target/tika-parsers-0.7.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
tika-app/target/tika-app-0.7.jar Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
tika-bundle/target/tika-bundle-0.7.jar Tika bundle. An OSGi bundle that includes everything you need to use all Tika functionality in an OSGi environment.

<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>0.7</version>
</dependency>

If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you'll want to depend on tika-parsers instead:

<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>0.7</version>
</dependency>

Wednesday, 5 May 2010

IBM FileNet P8 Platform

Disclaimer: This entry is not complete and will be probably be finished later, since some information needs to be checked. The goal is to have a short summary of the documentation I found.

The IBM FileNet P8 Platform is a platform providing enterprise content management. Documentation can be found on the IBM website at this place. A FileNet P8 System Overview can be found there. The source of most of the information for this entry comes from this document.

content management
business objects
Lifecycles
Properties
Events and subscription
classification
entry templates
publishing
content storage
content caching
import and export
search
versioning
process management
extern application integration
form management
record Management (Email management)
system management
application development and deployment
scalability
high availability
disaster recovery
accessability
security
internationalization (i18n)

FileNet uses XML and Java2 Platform, Enterprise Edition (J2EE), as well as mainly the following protocols: Lightweight Direct Access Protocol (LDAP), Hypertext Transport Protocol (HTTP), and SOAP.

Content Management

Business Objects

Lifecycles

Properties

Events and subscription

The platform provides an event framework so as to push to subscribers the given event of the system.

Classification

The FileNet platform provides the infrastructure for different kind of classification of the resources. This classification can either be performed manually but also automatically using specific tools.

Entry Templates

Entry templates provides means of creating objects in a more uniform manner.

Publishing

The platform also provides means of publishing the content stored.

Content Storage

Content Caching

Import and Export

Search

Versioning

The FileNet application provides the tools to perform a versioning of the resource stored in the system.

Process Management

Extern Application Integration

The platform can be integrated in other existing applications such as: Microsoft Office and SAP R/3 and Sharepoint.

Form Management

The documentation of FileNet says that the application provides powerful form creation and management tools.

Record Management (Email management)

System management

Application Development and Deployment

Scalability

One interesting aspect of the FileNet P8 Platform is that it seems to be designed considering scalability issues and techniques. It provides for diverse components horizontal scalability solutions (like computer farms) or vertically (i.e multiple instances of an application can be run in parallel.

Accessability

Accessability is an issuse for enterprise software in order to make sure that every one can use the software. The software is tested according to the Section 508 Compliance based on Electronic and Information Technology Accessibility Standards, published by the U.S. Access Board on December 21, 2000, at 36 CFR Part 1194.. This includes for example key traversal and access.

Architecture

The preceding picture shows the architecture of FileNet, it is taken from the document cited earlier. In addition to this overall picture it should be noted that FileNet provides both a Java and .Net API, although the java API seems to be the one providing the most functionality.

Content Engine

The content engine is the component taking care of the management of the content. It provides all the necessary functionalities, for example secure access, caching, indexing (also full text), search, classification, versioning, life cycles....

The access is either provides through a Java or .Net API or using the Content Engine Web Services. A particular mode when using Java is to use the EJB Means of transport.

Process Engine

The process engine provides a number of components.

Process Analyzer (which is a OLAP component)
Process Simulator to test scenarios
Business Process Framework

Application Engine

Workspace XT - The Graphical Interface

Rendition Engine

Rendition Engine can be used to convert documents to various formats, for example the usual Office formats (Word, Excel, PowerPoint) as well as to PDF or HTML. Multiple Rendition Engines can also be used in order to scale the document conversion process.

Administrative Components

The platform provides a number of adminstrative components: the dashboard, the system usage reporter, the system monitor

Tuesday, 4 May 2010

UIMA - Unstructured Information Management Architecture

Disclaimer: This entry is not complete and will be finished later, since some information needs to be checked.

The UIMA (Unstructured Information Management Architecture) is a project which was first created by IBM, but which is now one of the top-level project of the Apache Software Foundation. It provides an architecture to annotate a unstructured information with the help of a set of annotators and analysis engines which can be combined and aggregated.

In the following sections, I will introduce the main elements which allow the understanding of the UIMA infrastructure.

CAS - Common Analysis Structure

The main structure in the UIMA architecture is the CAS (aka. Common Analysis Structure). Note that I had some difficulties finding what it means, but I finally found it in the glossary, which should be read at first because I even in the overview there was no explanation as to what a CAS is.

A CAS is the structure manipulated by the annotators and annotation engines.

Analysis Engines

The UIMA architecture provides the idea of analysis engines which take a CAS View (i.e some annotation structure representing a view of the data) and return a .

Annotators

The glossary of the UIMA documentation defines annotators as:

A software component that implements the UIMA annotator interface. Annotators are implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video).

They represent the starting point for the analysis engine.

Indexing

One of the main interest of the UIMA architecture is that it provides a standard interface to define the indexing of the CAS and their views. However, I still need to clear things up here.

PEAR

A PEAR is an archive file packaging the code, descriptor files and other resources required to install and run a UIMA component in another environments. The UIMA SDK provides tools to create such PEAR. Note that the PEAR acronym is not defined in the documentation either.

New Top Level Apache Projects

In an announcement per mail, the Apache Software Foundation announced a number of new top level projects.

Apache Traffic Server is a richly-featured, fast, scalable, and extensible HTTP/1.1 compliant caching proxy server.

Apache Mahout provides scalable implementations of machine learning algorithms on top of Apache Hadoop and other technologies.

Apache Tika is an embeddable, lightweight toolkit for content detection, and analysis.

Apache Nutch is a highly-modular, Web searching engine based on Lucene Java with added Web-specifics, such as a crawler, a link-graph database, and parsers for HTML and other document formats.

Apache Avro is a fast data serialization system that includes rich and dynamic schemas in all its processing.

Apache HBase is a distributed database modeled after Google's Bigtable.

Apache UIMA (Unstructured Information Management Architecture) is a framework for analyzing unstructured information, such as natural language text.

Apache Cassandra ( an advanced, second-generation “NoSQL” distributed data store that has a shared-nothing architecture)

Apache Subversion a source code management system very often used in enterprise and open source projects.

Apache Click is a modern Java EE Web application framework that provides a natural, rich client style programming model.

Apache Shindig is an OpenSocial container and helps you to start hosting OpenSocial apps quickly by providing the code to render gadgets, proxy requests, and handle REST and RPC requests.

I believe I am becoming somewhat an apache fan boy ;-).

Monday, 3 May 2010

OSGi

OSGi (used to be "Open Services Gateway initiative") is a standard to define a software platform infrastructure for java. The goal is to have the infrastructure to deploy modularised applications and services with a comnponent model (called Bundles or services). The components can be managed using a service registry. They can be loaded, started and stopped.

The OSGi standard uses metadata found in the Jar file Manifests, in order to load the bundles. In particular, the manifests specifies the classes exported and imported by the bundle. In that way, it is possible to use, or hide conflicting classes in a bundle and not export it. Other bundles may use the classes exported by other bundles.

Bundles

Bundles are jar files with a corresponding entries in the manifest. The following example shows the manifest of a bundle requiring the package org.eclipse.ui.

Manifest-Version: 1.0
Bundle-ManifestVersion: 2
Bundle-Name: My Yellow World Example
Bundle-SymbolicName: de.desprofundis.example; singleton:=true
Bundle-Version: 1.0.0
Bundle-Activator: de.desprofundis.example.Activator
Require-Bundle: org.eclipse.ui,
org.eclipse.core.runtime
Bundle-ActivationPolicy: lazy
Bundle-RequiredExecutionEnvironment: JavaSE-1.6

Note the "Require-Bundle", entry which lists the packages which are required by this bundle. The OSGi container is responsible for checking whether the dependencies are satisfied. Another important information is also here the Bundle-Activator, which is the class in charge of the activation of the bundle (as well as its shutting down when needed.

Services

In addition to the dependency management and version hiding, the OSGi framework provides also a registry for services. Moreover, services can be injected in to some other bundle. A tutorial by Lars Vogel presents their use succintly.

Zen Coding

Zen Coding is an utility to generate simple HTML skeleton structures using a path like expression with some syntactic enhancement. From the demos, it is really quite impressive, because in this way one can generate quite quickly complex HTML structures.

I read the entry on slashdot.org and then followed the link to this blog.

Though I still need to try it. I am amazed. I have to make it work in eclipse and emacs (aswell as perhaps vi).