SSL for Astyanax

Security is important for any application that handles personal data and one of the most common ways of protecting the wire is through the use of SSL. Because of my struggle to setup client- server SSL I though I'd share what I learned in the form of a brief tutorial on the steps necessary to implement client-server security between Cassandra and the Astyanax client.

Firstly to clarify the terms used.

Astyanax - A high-level, thrift based Java API for Cassandra.
Java Keystore (JKS) - A file that stores private keys, and the certificates with their corresponding public keys.
Java Truststore (JTS) - Another file that contains certificates from other parties that you expect to communicate with, or from Certificate Authorities that you trust to identify other parties.
Keytool - A key and certificate management utility allowing for creation and signing of certificate stores.
Secure Sockets Layer (SSL) - cryptographic protocol that provides communication security over the Internet.

Creating the Necessary Certificates

The simplest guide out there was Acunu's guide to Cassandra security. The section headed 'Cassandra client certificates' contains detailed information and explanation on the various keytool commands used to generate the necessary certificate stores and are the source of the below snippet of code outlining the process:

Configuring Cassandra to use SSL

The configuration takes place in the cassandra.yaml file located in casandradir/conf/ We are after the following option client_encryption_options (typically located near the bottom of the file) which should be modified to the below snippet:


Enabling SSL in the Astyanax client

To allow Astyanax to communicate with Cassandra the password and path to the certificate store has to be supplied when creating the Astyanax context.
The file containing the certificate store, based on the above certificate generation example, is "cassandra_external_trust.jks".
Within the creation of the Astyanax context, in the withConnectionPoolConfiguration method we call setSSLConnectionContext to enable SSL and pass a SSLConnectionContext object as a parameter. The object requires two parameters, the part to the certificate store and the store’s password:




And now our communication channel is protected by encryption via SSL. The below two screen shots of Wireshark will hopefully demonstrate why it's a good idea to protect your communication channel.

Regular Thrift traffic

Unencrypted thrift traffic with human-readable sensitive data

Encrypted Thrift traffic

Encrypted thrift traffic with scrambled data that is uncreadable by anyone who doesn't hold the private key.

Setting up C* - OSX vs Windows


This post is a quick overview of my experience of trying to build a platform independent application that uses Cassandra (I say platform independent, but what I mean is that Ubuntu, OSX and Win7/8 are supported).

Issues

  • Permission - Unix based systems enforce permissions tightly and prevent usage of the /var/log and /var/lib directories without root access.
  • System configurations - Windows requires the CASSANDRA_HOME environment variable to be configured.
  • Dependencies - JDK is required for Cassandra so the fact that OSX ships with the developer tools is very handy, however if a specific JDK is required, replacing OSX's default can be problematic.



Epic Workarounds Solutions!

Permissions

Since each Casstor uses has a Cassandra node running on their local account it couldn't be expected that users would have root / administrator access so the data directories were placed on the user's desktop.


System configurations

Because no environment variables are required by UNIX based systems, there wasn't much of a problem, and for windows the CASSANDRA_HOME variable will be added by a batch file (dos script) using the setx command in two ways, first an attempt is made to set a system level environment variable, if that fails than a user level environment variable is added:
C:> setx -m CASSANDRA_HOME "C:\Users\%username%\Desktop\cassandra"
C:> setx CASSANDRA_HOME "C:\Users\%username%\Desktop\cassandra


Dependencies

In the windows branch of the application a warning message is displayed that Cassandra will not start if a runtime is unavailable. Also launch4j was used to allow simple JDK downloading if required.

Bootstraping through VPN


The CassTor application relies on VPN allowing you to bootstrap as if you are part of a LAN network from remote machines. Today a problem was encountered for the mechanism of detecting which IP was to be placed in cassandra's configuration (for the listen_address option).

The current code structure first checks if a VPN connection is available, then looks for a Ethernet connection and finally if the previous two are unavailable, a WIFI connection is used.



On a side note, some usability testing was carried out. The aim was to workout which of the three connection types was used most frequently. With the help of ten volunteers, it was established that most liked the VPN connection best as CassTor always managed to bootstrap straight away, while with WIFI and Ethernet connections there were occasionally timeouts:



The above pretty much concludes that the current approach assumes too much about the network infrastructure that casstor can handle. One possibility is to switch to a Network topology keyspace, but it could be argued that the reliability VPN provides should be part of the prerequisites of the application.

Thanks to all of you who helped out!

APIs and Frameworks


A large combination of APIs and frameworks were used to build CassTor, in this blog I'll share my experience in dealing with each and why I decided to use them in the first place.

APIs

Astyanax

A very light APIs that allows interfacing with Cassandra. Astyanax is difficult to pronounce and very easy to get started with! There was a huge choice in APIs to use as connectors to Cassandra and even more have come out since the beginning of this project with the latest being the ODBC driver for HIVE from Datastax. I feel that I've got plenty of experience of Cassandra APIs, some experiences were enjoyable, like using CQL for the first time, and some not so much (I didn't like hector but back in 2011 it seemed like a good choice). I chose Astyanax for three main reasons: It was something new, I quite enjoy learning new things and the API became very popular with the stackoverflow community when I was starting my project and the API supported CQL and RPC.

Astyanax has been one of the simplest APIs I've ever gotten to use and I'd like to think that it is thanks to Netflix's amazing documentation. My struggle with hector was primarily because I didn't know how Cassandra worked at the time, but having to look through unit tests to find code snippets for creating keyspaces and such wasn't very helpful.


Silvertunnel TOR

This was a choice based on the lack of choice available. There are few libraries for connecting through TOR, netlib and the MIT torlib, one was C++ wrapped in java and the other was using out-dated socket implementations (that when tested failed to connect correctly). Silvertunnel is a highly threaded API that allows for redirecting network traffic of a Java application by overriding the Socket class and redirecting it through the NetLayer provided by the API. Sadly connecting to TOR is very slow (I really mean slow, it can take anything between 6-15 minutes) and doesn't work if a firewall is present but once a connection is established querying delays are small (a typical query that requires 8ms to execute will jump up-to 150ms, a big jump, but still not noticeable by a human unless executing hundreds of sequential queries).


Frameworks

Java SWING

SWING was the prime contender if CassTor was to be implemented as a desktop application. The decision for a desktop application was based on avoiding unnecessary connections to the internet.

Why did I pick SWING? Again, something new, I've made a number of JSP/Spring MVC web applications but never a desktop application, and when researching the various frameworks out there, SWING seemed like a really good choice as it allowed for styling an application to look like it's running natively on OSX and Windows7 and building a cross-platform application was the ultimate goal.

CassTor internal email frame

Swing is a great framework, but there was a number of issues faced while using it, most of which were because my lack of experience with Java GUI design. I struggled to build frames so that when they were resized, their sizes would change to look suitable. Also it's great when IDEs try to prevent you from breaking your application, but Netbean's approach of disabling me from altering GUI related code was very frustrating, to the point where I was considering switching to IntelliJ, but it was too late in the project and dealing with potential issues would be a loss of time I couldn't afford.


To sum it up, each library/framework used had it's challenges but overall not using them would make it much-much more challenging to build the project!

Cassandra Config


Handling growth.

The system depends on it’s users. Each user’s machine will be a Cassandra node that forms part of the cluster. Because it cant be expected for the user’s machine to be on all the time each node will store the full dataset, to deal with this potential issue the replication factor of the keyspace containing the messages will be incremented when a new user joins the cluster. Upon start-up their local Cassandra server will bootstrap to the cluster and become part of the ring. The token used for this operation will be 0, so that Cassandra itself can decide what token should be supplied and since each node stores 100% of the data, the token will not have a negative impact on the balance of the cluster.


New user joining the cassandra cluster - Usecase diagram

Configuring Cassandra to allow for such behaviour. Firstly a simple keyspace was used to allow for strict control of the replication factor, next the rpc_address configuration was set to 0.0.0.0 so that cassandra listents on all interfaces for incoming traffic on port 9160 and the listen_address was left with <INTERFACE_IP>, a tag that will be picked up by CassTor and replaced with an interface address that CassTor will detect on it’s first run.

Are we there yet?


Upgrading to the latest

There are too many new features in Cassandra 1.2.3, and at this stage I'm planning on upgrading from 1.2.0 to the latest version. There aren't too many changes that can affect the upgrade, the prime concern is the switch from Thrift to CQL, but when I was reading up on this its clear that Apache have kept interfacing with thrift as an option for backwards compatibility. An attempt of switching will be made on Tuesday and hopefully not too much has changed.

Joining the ring

Cassandra ring after a new user.

The three seed nodes are 134.36.36.188, 134.36.36.189, 134.36.36.217 with each host running on Windows 8, Ubuntu and OSX respectively. The final machine, 134.36.36.116 is an OSX machine representing the first user to join the ring and is connected to the ring through a VPN connection. With each user that joins the ring will grow, the theory being that the more users there are, the higher the availability of the network.


Balance - it's key

There are a number of potential issues with the current system, firstly keeping track of which hosts are up/down when a new user joins the system. This shouldn't be a big problem because the three seeds nodes should be up all the time and one node going down will hopefully still allow for new users to bootstrap successfully.

The next problem is balancing the data with the incrementing replication factor. From the above diagram it's clear that the ring isn't well balanced and with new users joining the ring, it will become even less balanced. Currently the argument for "don't worry, it'll be ok" is that each node holds 100% of the data, so balanced or not, it doesn't matter if a node goes down (its expected for nodes to go down as people using the system will switch their machines off) the data will still be available.


The great idea for big data


What’s the big idea?

A simple one! Make a cross-platform, secure email system that can grow all by itself.

How can it be done?

To build such an application a number of open-source technologies are going to be combined, each with it’s own purpose.

Technology Purpose
Cassandra Provide the ability to scale horizontaly regardless of dataset’s size or number of users in the system.
Java Cross-platform, high level language which has access to a wide range of Cassandra APIs.
TOR Obfuscates source/destination of the data and provides protection against network survailance.


An overview

Technology Overview

The client of the email system will be built using Java SWING allowing for a cross-platform application. The client will connect to Cassandra using the Astyanax API using TOR to protect the connection from survailance.


Hello World People!

What?

This blog will be an informal documentation of my progress with CassTor, a project for the next great data developer competition by Datastax.

Why?

Easy question! I love using Cassandra, everybody uses email and security is always a concern. The system will focus on security and the ability to scale, without having to invest money in hosting that guarantees 99.99% uptime, because even a percentage that high isn’t good enough knowing that using Cassandra can yeild better results.

Where?

I’m not a ninja, so there’s a number of places that you can find me and my project: