Interactive-diagrams updates

It has been some time since I’ve written a blog post. Today I would like to present you the latest updates from the interactive-diagrams project.

  • diagrams-ghcjs finally got text support

Big thanks to Joel Burget for the implementation.

The testsuite demo for the package can be found at http://co-dan.github.io/ghcjs/diagrams-ghcjs-tests/Main.jsexe/ as usual.

  • New interactive widgets

We got rid of the old wizard-like widgets in favour of “all-in-one” style widgets. Thanks to Brent and Luite who came up with the type trickery to get this done.

  • New flat design

In order to match your slick & flat iOS7 style I’ve rolled out an update to Bootstrap 3.

  • Ability to paste code with errors. (It’ll ask you to make sure you didn’t make a mistake by accident)
  • Ability to quickly include a number of imports. By checking the “import standard modules” checkbox you’ll automatically bring several useful modules into the scope:
    • Diagrams.Prelude
    • Diagrams.Backend.SVG (or Diagrams.Backend.GHCJS if you are making an interactive widget)
    • Data.Maybe
    • Data.Tuple
    • Data.List
    • Data.Char
  • Other minor UI fixes such as documentation improvement

GSoC 2013, an afterword

The Summer of Code 2013 is over, and here is what I have to say about it.

Introduction

The project is live at http://paste.hskll.org. The source code can be found at http://github.com/co-dan/interactive-diagrams.

I would like to say that I do plan to continue working on the project (and on adjacent projects as well if possible).

Interactive diagrams

Interactive diagrams is a pastebin and a set of libraries used for dynamically compiling, interpreting and rendering the results of user inputted code in a secure environment.

The user inputs some code and the app compiles and renders it. Graphical output alongside with code can be useful for sharing the experiments, teaching beginners an so on. If the users inputs a code that can not be rendered on the server (i.e.: a function), the app produces an HTML5/JS widget that runs the corresponding code.

The produced libraries can be used in 3rd party services/programs.

Screenshot

Screenshot

Technology used

The pastebin is powered by Scotty and scotty-hastache, the access to PosgreSQL db is done via the excellent persistent library. The compilation is done using GHC and GHCJS inside the workers processes powered by the restricted-workers library.

You can read some my previous report on this project which is still pretty relevant.

I plan on updating the documents on the wiki sometime soon.

Progress

The bad news is that I don’t think I was able to 100% complete what I originally envisioned. The good news is that I seem to know, almost exactly, what do I want to improve and how to do that. As I’ve mentioned I plan on continuing with the project and I hope that the project will grow and improve.

One thing that I felt was annoying is the (technical) requirement to use GHC HEAD. Because of that a lot of packages required updates and fixes. Due to changes in GHC and bugfixes in GHCJS I had to perform the tiring and not so productive procedure of updating all the necessary tools, rebuilding everything and so on. But I guess that’s just how computers work and I am sure that in the future (with the release of GHC 7.8 and a new Haskell Platform) the maintenance and installation will be much easier. Another thing that took a lot of my time was configuring the system and setting up the necessary security measures, which was also necessary.

Other stuff that kinda slowed thing down include: the lack of a good build system, in some cases non-American timezone (actually I think that the fact that my mentor, Luite Stegeman, was quite close to me in terms of timezones allowed us to communicated very frequently, as we did), the lack of knowledge of the tools I used (although you can think of it this way: I had an ability to learn exciting new things ;] ).

Among the grand things I plan to do: release a library for manipulating Haskell AST at the GHC level; make an IRC bot using the eval-api and restricted-workers; continue writing my notes/tutorials about GHC API (I have a few drafts laying around).

Some code refactoring should come along and a number of features for the pastebin should be implemented.

Feelings

When the end of the program was approaching I predicted that I would have that sort of conflicted feelings that you usually get when you finish reading a good book – one part of you feels happy because you had an enjoyable experience, yet another part of you doesn’t feel so giddy, because the thing that you enjoyed is over. Well, I didn’t get this with GSoC. I did feel happy, but I didn’t get this touch of sadness. GSoC was a way for me to get into collaborating with people on real-world open source projects, and the end of GSoC for me is a beginning of something else. I can use my experience now to write better code, write more code and write useful code.

Concluding

I had a very exciting summer and I would positively recommend anyone eligible to participate in the Google Summer of Code program. There is, however, a thing to remember. Programmers are known to be the kind of people who set ambitious goals. Reach for something inspiring, ambitious, yet realistic. Make sure to find something in between, that way you’ll have a concrete target that you know that you are able to achieve, but you also have a room for improvement.

PS. Acknowledgments

I would like to thank various people who helped me along the summer: Luite Stegeman, Brent Yorgey, Carter Schonwald, Daniel Bergey, Andrew Farmer; everyone in #diagrams, everyone in the #haskell channel who patiently answered my question; everyone on GitHub who responded to my comments, questions and pull requests. The Haskell community is lucky to have a huge amount of friendly and smart people.

WIP: GHCJS backend for the diagrams library

About

I’ve picked up the development of the diagrams-ghcjs backend for the infamous diagrams library. This backend is what we use for the interactive-diagrams pastebin and it renders diagrams on an HTML5 Canvas using the canvas bindings for ghcjs. The diagrams-ghcjs is a fork of (unmaintained?) diagrams-canvas backend by Andy Gill and Ryan Yates.

The library is still WIP and it requires bleeding edge versions of ghcjs, ghcjs-canvas and ghcjs’ shims to function.

The library is scheduled to be released together with ghcjs.

Demo

The current demo can be found at http://co-dan.github.io/ghcjs/diagrams-ghcjs-tests/Main.jsexe/. It is supposed to work in both Firefox and Chrome.

Problems

  • Text is not implemented. Some work is being done in the text-support branch. Generally, it has been proven hard to implement good text support, even diagrams-svg backend text support is limited;
  • Firefox still does not support line dashing via ‘setLineDash’ and ‘lineDashOffset’. As a result we need to have additional shims.

ANN: restricted-workers-0.1.0

Introducing: restricted-workers library, version 0.1.0.

This library provides an abstract interface for running various kinds of workers under resource restrictions. It is being developed as part of the interactive-diagrams project and you can read more about the origins of the library in my GSoC report: http://parenz.wordpress.com/2013/07/15/interactive-diagrams-gsoc-progress-report/

The library provides a convenient way of running worker processes, saving data obtained by the workers at start-up, a simple pool abstraction and a configurable security and resource limitations.

Right now there are several kinds of security restrictions that could be applied to the worker process:

  • RLimits
  • chroot jail
  • custom process euid
  • cgroups
  • process niceness
  • SELinux security context

You can read more about the library on the wiki: https://github.com/co-dan/interactive-diagrams/wiki/Restricted-Workers

The library has been uploaded to hackage and you can install it using cabal-install.

Pastebin update

I have updated the pastebin design and added some useful features.

Among with some minor tweaks the main changes are:

  • Author & title field added
  • Slick bootstrap design including buttons, pills and other web two oh stuff
  • Gallery of random images from the pastebin database
  • Two modes for viewing a paste: view mode and edit mode (edit mode still lacks a sophisticated JS editor)
  • Code highlighting in the view mode
  • Installed all the Acme packages on the server

Check the new website out: http://paste.hskll.org

If you have any suggestions regarding the design or the functionality of the web site please don’t hesitate to mail me or leave a comment.

View mode

View mode

New paste

New paste

Interactive-diagrams GSoC progress report

Intro

As some of you may already know, I’ve published the first demo version of the interactive-diagrams online, it can be found at http://paste.hskll.org (thanks to my mentor Luite Stegeman for hosting). It’s not very interactive yet, but it’s a good start. At the same time it took me a while to get everything up and running so in this blog post I would like to describe and discuss the overall structure and design of the project along with some details about the vast number of security restrictions that are being used.

Please note that http://paste.hskll.org is just a demo version and I can guarantee neither the safety of your pastes nor the uptime of the app. The ‘release notes’ can be found here.

If you have any suggestions or bug reports don’t hesitate to mail me (difrumin аt gmail dоt com) or use the bugtracker.

System requirements

GNU/Linux operating system, GHC 7.7 (I think it’s possible to make the whole thing work with GHC 7.6 but I don’t have time to support it and test everything), lots of RAM. In order to use some security restrictions you will also need SELinux and cgroup.

High-level structure

The whole program consists of three main components (it would be better to say three main types of components since there are usually multiple workers in the system):

  • The web app (sources can be found in scotty-pastebin), powered by WAI, Scotty and Blaze;
  • The service app (eval-api/src-service);
  • Workers (eval-api/src).

The web server handles user requests, database logic, and renders the results. Workers are the processes that perform the actual evaluation. The service component is the one that handles the pool of workers, keeps track of how many workers are available and forks new workers if necessary. The web app does not communicate with workers without the permission of the service.

All the communication between the components is performed with the help of UNIX sockets.

Request example

Here’s an example workflow in the system:

  1. User connects to the web server, sends the request to evaluate a certain bit of code.
  2. Web server talks to the service, requesting a worker.
  3. Server reuses an existing worker if an idle one exists. Otherwise it forks a new one or blocks if the limit is reached.
  4. Worker, upon starting, loads the necessary libraries and applies security restrictions.
  5. The web server receives a worker and sends it a request for evaluation.
  6. The server waits, if there is no reply from the worker after a certain amount of time, it sends a message to the service saying that the worker timed out. If the web service receives the reply, it stores the result in the database and continues with the user request.
  7. When the service receives message about one if its workers it decides whether to kill/restart it or not. If the worker’s process has timed out or results in an error (eg: out of memory exception) then the service restarts it.

Component permissions

Setting up the right permissions for the components is a crucial part in creating a secure environment. Depending on what security restrictions you have enabled you might want to choose different permissions for the processes. On http://paste.hskll.org we use the full set of security restrictions and limits (see the next section) and it requires us to give the components certain permissions.

  • scotty-pastebin is run as a user in a multithreaded runtime;
  • eval-service is run as a superuser (required for setting up chrooted jails) in a single-threaded environment (required due to forking/SELinux restrictions, see the SELinux section for details), listens on the ‘control’ socket;
  • workers are forked from eval-service as root, but they change their processes’ uid as soon as possible, listens on the ‘workerN’ socket (opened prior to chroot’ing);

Additionally the whole thing runs in a VM.

See also this wiki page written by luite

Security limitations and restrictions

Interactive-digrams applies a whole lot of limitations to the worker processes, which can be configured using the following datatype:

data LimitSettings = LimitSettings
    { -- | Maximum time for which the code is allowed to run
      -- (in seconds)
      timeout     :: Int
      -- | Process priority for the 'nice' syscall.
      -- -20 is the highest, 20 is the lowest
    , niceness    :: Int
      -- | Resource limits for the 'setrlimit' syscall
    , rlimits     :: Maybe RLimits
      -- | The directory that the evaluator process will be 'chroot'ed
      -- into. Please note that if chroot is applied, all the pathes
      -- in 'EvalSettings' will be calculated relatively to this
      -- value.
    , chrootPath  :: Maybe FilePath
      -- | The UID that will be set after the call to chroot.
    , processUid  :: Maybe UserID
      -- | SELinux security context under which the worker 
      -- process will be running.
    , secontext   :: Maybe SecurityContext
      -- | A filepath to the 'tasks' file for the desired cgroup.
      -- 
      -- For example, if I have mounted the @cpu@ controller at
      -- @/cgroups/cpu/@ and I want the evaluator to be running in the
      -- cgroup 'idiaworkers' then the 'cgroupPath' would be
      -- @/cgroups/cpu/idiaworkers@
    , cgroupPath  :: Maybe FilePath
    } deriving (Eq, Show, Generic)

There is also a Default instance for LimitSettings and RLimits with most of the restrictions turned off:

defaultLimits :: LimitSettings
defaultLimits = LimitSettings
    { timeout    = 3
    , niceness   = 10
    , rlimits    = Nothing
    , chrootPath = Nothing
    , processUid = Nothing
    , secontext  = Nothing 
    , cgroupPath = Nothing
    }

Below I’ll briefly describe each limitation/restriction with some details.

Timeout & niceness

The timeout field specifies how much time (in seconds) the server waits for the worker. (Note: this is the only limitation that is controlled on the side of the web server. The corresponding procedure is processTimeout. We really want this to be run in the multithreaded environment)

Niceness is merely the value passed to the nice() syscall, nothing special.

rlimits

The resource limits are controlled by syscalls to setrlimit. The limits itself are defined in the RLimits datatype:

data RLimits = RLimits
    { coreFileSizeLimit :: ResourceLimits
    , cpuTimeLimit      :: ResourceLimits
    , dataSizeLimit     :: ResourceLimits
    , fileSizeLimit     :: ResourceLimits
    , openFilesLimit    :: ResourceLimits
    , stackSizeLimit    :: ResourceLimits
    , totalMemoryLimit  :: ResourceLimits
    } deriving (Eq, Show, Generic)

ResourceLimits itself is defined in System.Posix.Resource. For more information on resource limits see setrlimit(3).

Chrooted jail

In order to restrict the worker process we run it inside the chroot jail. The easiest way to create a fully working jail is to use debootstrap. It’s also necessary to install gcc and GHC libraries inside the jail.

mkdir
sudo debootstrap wheezy /idia/run/workers/worker1
sudo chmod  /idia/run/workers/worker1
cd /idia/run/workers/worker1
sudo mkdir -p ./home/
sudo chown  ./home/
cd ./home/
mkdir .ghc && sudo mount --bind ~/.ghc .ghc
mkdir .cabal && sudo mount --bind ~/.cabal .cabal
mkdir ghc && sudo mount --bind ~/ghc ghc # ghc libs
cd ../..
cp ~/interactive-diagrams/common/Helper.hs .
sudo chroot .
apt-get install gcc # inside the chroot

I tried installing Emdebian using multistrap to reduce the size of the jail, but GHC won’t run properly in that environment, complaining about librt.so (which was present in the system), so I decided to stick with debootstrap. If anyone knows how to avoid this problem with multistrap please mail me or leave a comment.

Process uid

This is the uid the worker process will run under. The socket file will also be created by the user with this uid.

SELinux

SELinux (Security-enhanced Linux) is a Linux kernel module providing mechanisms for enforcing fine-grained mandatory access control, brought to you by the creators of infamous PRISM!

SELinux allows the system administrator to control the security of the system by specifying in (modular) policy files the AVC (access vector cache) rules. The SELinux kernel module sits there and monitors all the syscalls and, if it finds something that is not explicitly allowed in the policy, it blocks it. (Well, actually something a little bit different is going on, but for the sake of simplicity I am leaving it out)

Everything on your system – files, network sockets, file handles, processes, directories – is labelled with a SELinux security context, which consists of a role, a user name (not related to the regular system user name) and a domain (also called type in some literature). In the policy file you specific which domains are allowed to perform various actions on other domains. A typical piece of the policy file will look like this:

allow myprocess_t self:udp_socket { create connect };
allow myprocess_t bin_t:file { execute };

The first line states that the process from the domain myprocess_t is allowed to create and connect to the UDP sockets of the same domain. The second line allows a process in that domain to execute files of type bin_t (usually files in /bin/ and /usr/bin).

Note: the secontext field actually contains only the
security domain. When the worker process is changing the security
context it uses the same user/resource as it originally had.

In our SELinux policy we have several domains:

  • idia_web_t – the domain under which the scotty-pastebin run
  • idia_web_exec_t – the domain of the scotty-pastebin executable and other files associated with that binary
  • idia_service_t – the domain under which the eval-service run
  • idia_service_exec_t – the domain of the eval-service executable and other files associated with that binary
  • idia_service_sock_t – UNIX socket files used for communication
  • idia_db_t, idia_pkg_t, idia_web_common_t – database files, packages, html files, templates and other stuff
  • idia_worker_env_t – chroot’ed environment in which the worker operates
  • idia_restricted_t – the most restricted domain in which the workers run and evaluate code

The reason we made the service program run in a single-threaded environment is the following: if we run it in the multi-threaded environment (like we wanted) the worker processes want to have access to file descriptors, inherited from the idia_service_t, which, of course, is dangerous and should not be allowed.

I personally don’t enjoy using SELinux very much. It’s very hard to configure it and among its shortcomings I can list the fact that there is no distinction between file types and process types; there is no proper separation, even when using the modular policy, as the duplicated types are checked when you load the module and there is no way (that I know of) to easily introduce a fresh unused type. And there is this thing that puzzled me for quite a while: home directories are treated specially. Even if you configure a subdirectory in your home dir to have a specific security context, restorecon won’t correctly install the context specified in the policy. You actually have to set he context yourself, using chcon.

Cgroups

Cgroups is the system that can be used to aid the way Linux schedules CPU time/shares, distributes memory to the processes. It does so by organizing processes into hierarchical groups with configured behaviour.

Installing cgroups on debian is somewhat tricky, because the package is a little bit weird.

sudo apt-get install cgroup-bin libcgroup1
sudo cgconfigparser -l ~/interactive-diagrams/cgconfig.conf

For our purposes we have a cgroup called idiaworkers. We also mount the cpu controller on /cgroups/cpu:

$> ls -l /cgroups/cpu/
total 0
-rw-r--r--. 1 root root 0 Jul 12 16:22 cgroup.clone_children
--w--w--w-. 1 root root 0 Jul 12 16:22 cgroup.event_control
-rw-r--r--. 1 root root 0 Jul 12 16:22 cgroup.procs
-rw-r--r--. 1 root root 0 Jul 12 16:22 cpu.shares
drwxr-xr-x. 2 root root 0 Jul 12 16:22 idiaworkers
-rw-r--r--. 1 root root 0 Jul 12 16:22 notify_on_release
-rw-r--r--. 1 root root 0 Jul 12 16:22 release_agent
-rw-r--r--. 1 root root 0 Jul 12 16:22 tasks
$> ls -l /cgroups/cpu/idiaworkers
total 0
-rw-r--r--. 1 root    root 0 Jul 12 16:22 cgroup.clone_children
--w--w--w-. 1 root    root 0 Jul 12 16:22 cgroup.event_control
-rw-r--r--. 1 root    root 0 Jul 12 16:22 cgroup.procs
-rw-r--r--. 1 root    root 0 Jul 12 16:22 cpu.shares
-rw-r--r--. 1 root    root 0 Jul 12 16:22 notify_on_release
-rw-r--r--. 1 vagrant root 0 Jul 14 06:21 tasks

In order to modify how much CPU time our group gets, we write to the cpu.shares file: sudo echo 100 > /cgroups/cpu/idiaworkers/cpu.shares. If we want to add the task/process to the group we simply append the tasks file: echo $PID >> /cgroups/cpu/idiaworkers/tasks. The workers append themselves to the task file automatically (if the cgroup restrictions are enabled in the LimitSettings).

Open problems/requests

I am still not sure how do I write tests for this project. Do I write tests for my GHC API wrappers? Do I write tests for my workers pool? I probably should take a look how similar projects handles those.

Outro

So, as you can see, we have something working here and now that we manage to take the initial steps it will be much easier for us to push changes and make them available for public to use and comment on. There is still a long way to come. The code needs some serious cleanup (we’ve switched the design model a couple of weeks ago, which affected the internal structure seriously), the documentation needs to be written. And of course new features are waiting to be implemented :) We will be supporting multiple UIDs for workers and looking into using LXC for simplifying the setup process too.

I would like to thank augur and luite for their editorial feedback.

Stay tuned for the next posts about configuring the program for evaluation settings and reusing the components from the library.

The Protocol Problem

In the interactive-diagrams project we have a bunch of components (processes) running independently and even under different UIDs. They however need to communicate with each other, and we’ve decided to go full UNIX-way (since already most of our code depend on POSIX-compatibility) and use IPC/UNIX-sockets for communication.

Originally, I’ve implemented the following protocol for sending the
data around:

When sending data:

  1. Encode the data using the functions from the cereal package.
  2. Take the ‘length’ of the resulting bytestring, send it over the
    socket.
  3. Send the encoded data on the next line.

Upon receiving data:

  1. Read the first line, deserialize it to an x :: Int
  2. Read the the next x bytes, deserialize the data.

A programmer experienced in the area would have probably already
spotted an error in this approach, but for me it took some time to
find a bug, once I’ve realised that I was getting deserialization
errors from time to time.

The problem of course is that I was relying on reading lines from the
socket, not bytes. For example, the number 2623 on my 64bit system has
serialized to something with the newline character in it

Q6I3.png

The solution to this problem is to read the first ‘8’ (or ‘4’) bytes
to get the length of the upcoming data. To make sure that the number
of bytes representing the length of the data is constant on all the
platforms I’ve switched to using Word32.

This approach too, of course, is not prone to errors. If you are
sending and receiving a lot of data consider using a streaming
library.

PS: We’ve planned originally to release the first public alpha-version yesterday, but it turned out that it is actually taking more time configuring SELinux, chroots and other security measure than expected. I’ve also changed the general structure of the project, I am using a different design model than I had a week ago. The new changes would make the application more salable and the resulting library can (and most likely will) be reused for other stuff. Stay tuned!

Summer of Code

Hello, everyone!

I’ve decided to reinstate this blog since I’ve got accepted to this year’s Google Summer of Code program. I’ll blog about my updates, stuff that I’ve been working on and bottlenecks and problems I’ve encountered.

My project is a pastebin site using diagrams and GHCJS to generate embeddable interactive widgets and static images/text in case when the pasted code does not require additional interaction. My mentor is Luite Stegeman, and Brent Yorgey and other nice people from the diagrams community has agreed to help.

I am very excited about this and happy that I’ve got a whole bunch of smart people to help me with this.

Unfortunately, as we haven’t sorted out a completely safe way to evaluate code coming from 3rd parties, there is no public version hosted anywhere yet. Meanwhile, there is a project on GitHub.

Hopefully, soon I’ll be able to publish a post about my experience with bootstrapping GHCJS.
Until then, stay tuned!