GSoC 2013, an afterword

The Summer of Code 2013 is over, and here is what I have to say about it.

Introduction

The project is live at http://paste.hskll.org. The source code can be found at http://github.com/co-dan/interactive-diagrams.

I would like to say that I do plan to continue working on the project (and on adjacent projects as well if possible).

Interactive diagrams

Interactive diagrams is a pastebin and a set of libraries used for dynamically compiling, interpreting and rendering the results of user inputted code in a secure environment.

The user inputs some code and the app compiles and renders it. Graphical output alongside with code can be useful for sharing the experiments, teaching beginners an so on. If the users inputs a code that can not be rendered on the server (i.e.: a function), the app produces an HTML5/JS widget that runs the corresponding code.

The produced libraries can be used in 3rd party services/programs.

Screenshot

Screenshot

Technology used

The pastebin is powered by Scotty and scotty-hastache, the access to PosgreSQL db is done via the excellent persistent library. The compilation is done using GHC and GHCJS inside the workers processes powered by the restricted-workers library.

You can read some my previous report on this project which is still pretty relevant.

I plan on updating the documents on the wiki sometime soon.

Progress

The bad news is that I don’t think I was able to 100% complete what I originally envisioned. The good news is that I seem to know, almost exactly, what do I want to improve and how to do that. As I’ve mentioned I plan on continuing with the project and I hope that the project will grow and improve.

One thing that I felt was annoying is the (technical) requirement to use GHC HEAD. Because of that a lot of packages required updates and fixes. Due to changes in GHC and bugfixes in GHCJS I had to perform the tiring and not so productive procedure of updating all the necessary tools, rebuilding everything and so on. But I guess that’s just how computers work and I am sure that in the future (with the release of GHC 7.8 and a new Haskell Platform) the maintenance and installation will be much easier. Another thing that took a lot of my time was configuring the system and setting up the necessary security measures, which was also necessary.

Other stuff that kinda slowed thing down include: the lack of a good build system, in some cases non-American timezone (actually I think that the fact that my mentor, Luite Stegeman, was quite close to me in terms of timezones allowed us to communicated very frequently, as we did), the lack of knowledge of the tools I used (although you can think of it this way: I had an ability to learn exciting new things ;] ).

Among the grand things I plan to do: release a library for manipulating Haskell AST at the GHC level; make an IRC bot using the eval-api and restricted-workers; continue writing my notes/tutorials about GHC API (I have a few drafts laying around).

Some code refactoring should come along and a number of features for the pastebin should be implemented.

Feelings

When the end of the program was approaching I predicted that I would have that sort of conflicted feelings that you usually get when you finish reading a good book – one part of you feels happy because you had an enjoyable experience, yet another part of you doesn’t feel so giddy, because the thing that you enjoyed is over. Well, I didn’t get this with GSoC. I did feel happy, but I didn’t get this touch of sadness. GSoC was a way for me to get into collaborating with people on real-world open source projects, and the end of GSoC for me is a beginning of something else. I can use my experience now to write better code, write more code and write useful code.

Concluding

I had a very exciting summer and I would positively recommend anyone eligible to participate in the Google Summer of Code program. There is, however, a thing to remember. Programmers are known to be the kind of people who set ambitious goals. Reach for something inspiring, ambitious, yet realistic. Make sure to find something in between, that way you’ll have a concrete target that you know that you are able to achieve, but you also have a room for improvement.

PS. Acknowledgments

I would like to thank various people who helped me along the summer: Luite Stegeman, Brent Yorgey, Carter Schonwald, Daniel Bergey, Andrew Farmer; everyone in #diagrams, everyone in the #haskell channel who patiently answered my question; everyone on GitHub who responded to my comments, questions and pull requests. The Haskell community is lucky to have a huge amount of friendly and smart people.

WIP: GHCJS backend for the diagrams library

About

I’ve picked up the development of the diagrams-ghcjs backend for the infamous diagrams library. This backend is what we use for the interactive-diagrams pastebin and it renders diagrams on an HTML5 Canvas using the canvas bindings for ghcjs. The diagrams-ghcjs is a fork of (unmaintained?) diagrams-canvas backend by Andy Gill and Ryan Yates.

The library is still WIP and it requires bleeding edge versions of ghcjs, ghcjs-canvas and ghcjs’ shims to function.

The library is scheduled to be released together with ghcjs.

Demo

The current demo can be found at http://co-dan.github.io/ghcjs/diagrams-ghcjs-tests/Main.jsexe/. It is supposed to work in both Firefox and Chrome.

Problems

  • Text is not implemented. Some work is being done in the text-support branch. Generally, it has been proven hard to implement good text support, even diagrams-svg backend text support is limited;
  • Firefox still does not support line dashing via ‘setLineDash’ and ‘lineDashOffset’. As a result we need to have additional shims.

GHC API: Interpreted, compiled and package modules

The third post in the series.

Intro

It’s hard to get into writing code that uses GHC API. The API itself is and the number of various functions and options significantly outnumber the amount of tutorials around.

In this series of blog posts I’ll elaborate on some of the peculiar, interesting problems I’ve encountered during my experience writing code that uses GHC API and also provide various tips I find useful.

I have built for myself a small layer of helper functions that helped me with using GHC API for the interactive-diagrams project. The source can be found on GitHub and I plan on refactoring the code and releasing it separately.

Today I would like to talk about a different ways of bringing contents of Haskell modules into scope, a process that is necessary for evaluating/interpreting bits of code on-the-fly.

Many of the points I make in this post are actually trivial, but nevertheless I made all of the mistakes I mentioned in this, perhaps post due to my naive approach of quickly diving in and experimenting, instead of reading into the documentation and source code. Now I actually realize that this post should been the first in the series, since it probably deals with more basic (and fundamental) stuff than the previous two posts. But anyway, here it is.

Interpreted modules

Imagine the following situation: we have a Haskell source file with code we want to load dynamically and evaluate. That is a basic task in the GHC API terms but nevertheless there are some caveats. We start with the most basics.

Let us have a file ‘test.hs’ containing the code we want to access:

module Test (test) where
test :: Int
test = 123

The basic way to get the ‘test’ data would be to load ‘Test’ as an interpreted module:

import Control.Applicative
import DynFlags
import GHC
import GHC.Paths
import GhcMonad            (liftIO) -- from ghc7.7 and up you can use the usual
    -- liftIO from Control.Monad.IO.Class
import Unsafe.Coerce

main = defaultErrorHandler defaultFatalMessager defaultFlushOut $ do
    runGhc (Just libdir) $ do
        -- we have to call 'setSessionDynFlags' before doing
        -- everything else
        dflags <- getSessionDynFlags
        -- If we want to make GHC interpret our code on the fly, we
                  -- ought to set those two flags, otherwise we
                  -- wouldn't be able to use 'setContext' below
        setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                                    , ghcLink   = LinkInMemory
                                    }
        setTargets =<< sequence [guessTarget "test.hs" Nothing]
        load LoadAllTargets
        -- Bringing the module into the context
        setContext [IIModule $ mkModuleName "Test"]

        -- evaluating and running an action
        act <- unsafeCoerce <$> compileExpr "print test"           
        liftIO act

The reason that we have to use HscInterpreted and LinkInMemory is that otherwise it would compile test.hs in the current directory and leave test.hi and test.o files, which we would not be able to load in the interpreted mode. setContext, however will try to bring the code in those files first, when looking for the module ‘Test’

dan@aquabox
[0] % ghc --make target.hs -package ghc
[1 of 1] Compiling Main             ( target.hs, target.o )
Linking target ...

dan@aquabox
[0] % ./target
123

Let’s try something fancier like printing a list of integers, one-by-one.

main = defaultErrorHandler defaultFatalMessager defaultFlushOut $ do
    runGhc (Just libdir) $ do
        dflags <- getSessionDynFlags
        setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                                    , ghcLink   = LinkInMemory
                                    }
        setTargets =<< sequence [guessTarget "test.hs" Nothing]
        load LoadAllTargets
        -- Bringing the module into the context
        setContext [IIModule $ mkModuleName "Test"]

        -- evaluating and running an action
        act <- unsafeCoerce <$> compileExpr "forM_ [1,2,test] print"
        liftIO act

But when we try to run it..

dan@aquabox
[0] % ./target
target: panic! (the 'impossible' happened)
  (GHC version 7.6.3 for x86_64-apple-darwin):
        Not in scope: `forM_'

Please report this as a GHC bug:

http://www.haskell.org/ghc/reportabug

Hm, it looks like we need to bring Control.Monad into the scope.

This brings us to the next section.

Package modules

Naively, we might want to load Control.Monad in a similar fashion as we did with loading test.hs

main = defaultErrorHandler defaultFatalMessager defaultFlushOut $ do
    runGhc (Just libdir) $ do
        dflags <- getSessionDynFlags
        setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                                    , ghcLink   = LinkInMemory
                                    }
        setTargets =<< sequence [ guessTarget "test.hs" Nothing
                                , guessTarget "Control.Monad" Nothing]
        load LoadAllTargets
        -- Bringing the module into the context
        setContext [IIModule $ mkModuleName "Test"]

        -- evaluating and running an action
        act <- unsafeCoerce <$> compileExpr "forM_ [1,2,test] print"
        liftIO act

Our attempt fails:

dan@aquabox
[0] % ./target
target: panic! (the 'impossible' happened)
  (GHC version 7.6.3 for x86_64-apple-darwin):
        module `Control.Monad' is a package module

Please report this as a GHC bug:

http://www.haskell.org/ghc/reportabug

Huh, what? I thought guessTarget works on all kinds of modules.

Well, it does. But it doesn’t “load the module”, it merely sets it as the target for compilation, basically it (together with load LoadAllTargets) does what ghc --make does. And surely it doesn’t make much sense to ghc --make Control.Monad when Control.Monad is a module from the base package. What we need to do instead is to bring the compiled Control.Monad module into scope. Luckily it’s not very hard to do with the help of the simpleImportDecl :: ModuleName -> ImportDecl name:

main = defaultErrorHandler defaultFatalMessager defaultFlushOut $ do
    runGhc (Just libdir) $ do
        dflags <- getSessionDynFlags
        setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                                    , ghcLink   = LinkInMemory
                                    }
        setTargets =<< sequence [ guessTarget "test.hs" Nothing ]
        load LoadAllTargets
        -- Bringing the module into the context
        setContext [ IIModule . mkModuleName $ "Test"
                   , IIDecl
                     . simpleImportDecl
                     . mkModuleName $ "Control.Monad" ]

        -- evaluating and running an action
        act <- unsafeCoerce <$> compileExpr "forM_ [1,2,test] print"
        liftIO act

And we can run it

dan@aquabox
[0] % ./target
1
2
123

Compiled modules

What we have implemented so far corresponds to the :load* command in GHCi, which gives us the full access to the source code of the program. To illustrate this let’s modify our test file:

module Test (test) where

test :: Int
test = 123

test2 :: String
test2 = "Hi"

Now, if we try to load that file as an interpreted module and evaluate test2 nothing will stop us from doing so.

dan@aquabo
[0] % ./target-interp
(123,"Hi")

To use the compiled module we have to bring Test into the context the same way we dealt with Control.Monad

main = defaultErrorHandler defaultFatalMessager defaultFlushOut $ do
runGhc (Just libdir) $ do
    dflags <- getSessionDynFlags
    setSessionDynFlags $ dflags { hscTarget = HscInterpreted
                                , ghcLink   = LinkInMemory
                                }
    setTargets =<< sequence [ guessTarget "Test" Nothing ]
    load LoadAllTargets
    -- Bringing the module into the context
    setContext [ IIDecl $ simpleImportDecl (mkModuleName "Test")
               , IIDecl $ simpleImportDecl (mkModuleName "Prelude")
               ]
    printExpr "test"
    printExpr "test2"

printExpr :: String -> Ghc ()
printExpr expr = do
    liftIO $ putStrLn ("-- Going to print " ++ expr)
    act <- unsafeCoerce <$> compileExpr ("print (" ++ expr ++ ")")
    liftIO act

Output:

dan@aquabox : ~/snippets/ghcapi
[0] % ./target
-- Going to print test
123
-- Going to print test2
target: panic! (the 'impossible' happened)
  (GHC version 7.6.3 for x86_64-apple-darwin):
	Not in scope: `test2'
Perhaps you meant `test' (imported from Test)

Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Note: I had to bring the Prelude into context this time, like a regular module. I tried setting the ideclImplicit option in ImportDecl, but it didn’t work for some reason. Maybe it actually supposed to do not what I think it supposed to do, but something else.

Outro

So, that is it, we have managed to dynamically load Haskell source code and evaluate it. I can only refer you to the GHC haddocs for specific functions that we used in this post, most of them contain way more options that we used and they might prove to be useful to you.

ANN: restricted-workers-0.1.0

Introducing: restricted-workers library, version 0.1.0.

This library provides an abstract interface for running various kinds of workers under resource restrictions. It is being developed as part of the interactive-diagrams project and you can read more about the origins of the library in my GSoC report: http://parenz.wordpress.com/2013/07/15/interactive-diagrams-gsoc-progress-report/

The library provides a convenient way of running worker processes, saving data obtained by the workers at start-up, a simple pool abstraction and a configurable security and resource limitations.

Right now there are several kinds of security restrictions that could be applied to the worker process:

  • RLimits
  • chroot jail
  • custom process euid
  • cgroups
  • process niceness
  • SELinux security context

You can read more about the library on the wiki: https://github.com/co-dan/interactive-diagrams/wiki/Restricted-Workers

The library has been uploaded to hackage and you can install it using cabal-install.

Adding a package database to the GHC API session

The second post in the series.

Intro

It’s hard to get into writing code that uses GHC API. It’s huge there are so many options around and not a lot of introduction-level tutorials.

In this series of blog posts I’ll elaborate on some of the peculiar, interesting problems I’ve encountered during my experience writing code that uses GHC API and also provide various tips I find useful.

I have built for myself a small layer of helper functions that helped me with using GHC API for the interactive-diagrams project. The source can be found on GitHub and I plan on refactoring the code and releasing it separately.

One particular thing I had to do was to add a GHC package database to the GHC API session.

For those familiar with the structure of the interactive-diagrams project: since the workers run in a separate environment, each of them has it’s own chroot jail including each own package database. I had to manually set up a path to package database for each worker so it would pick up the necessary packages.

Package databases

A package database is a directory where the information about your installed packages is stored. For each package registered in the database there is a .conf file with the package details. The .conf file contains the package description (just like in the .cabal file) as well as path to binaries and a list of resolved dependencies:

$ cat aeson-0.6.1.0.1-5a107a6c6642055d7d5f98c65284796a.conf
name: aeson
version: 0.6.1.0.1
id: aeson-0.6.1.0.1-5a107a6c6642055d7d5f98c65284796a

import-dirs: /home/dan/.cabal/lib/aeson-0.6.1.0.1/ghc-7.7.20130722
library-dirs: /home/dan/.cabal/lib/aeson-0.6.1.0.1/ghc-7.7.20130722

depends: attoparsec-0.10.4.0-acffb7126aca47a107cf7722d75f1f5e
         base-4.7.0.0-b67b4d8660168c197a2f385a9347434d
         blaze-builder-0.3.1.1-9fd49ac1608ca25e284a8ac6908d5148
         bytestring-0.10.3.0-66e3f5813c3dc8ef9647156d1743f0ef

You can use ghc-pkg to manage installed packages on your system. For example, to list all the packages you’ve installed run ghc-pkg list. To list all the package databases that are automatically picked up by ghc-pkg do the following:

$ ghc-pkg nonexistentpkg
/home/dan/ghc/lib/ghc-7.7.20130722/package.conf.d
/home/dan/.ghc/i386-linux-7.7.20130722/package.conf.d

See ghc-pkg --help or the online documentation for more details.

Adding a package db

By default GHC knows only about two package databases: the global package database (usually /usr/lib/ghc-something/ on Linux) and the user-specific database (usually ~/.ghc/lib). In order to pick up a package that resides in a different package database you have to employ some tricks.

For some reason GHC API does not export an clear and easy-to-use function that would allow you to do that, although the code we need is present in the GHC sources.

The way this whole thing works is the following:

  1. GHC calls initPackages, which reads the database files and sets up the internal table of package information
  2. The reading of package databases is performed via the readPackageConfigs function. It reads the user package database, the global package database, the “GHC_PACKAGE_PATH” environment variable, and applies the extraPkgConfs function, which is a dynflag and has the following type: extraPkgConfs :: [PkgConfRef] -> [PkgConfRef] (PkgConfRef is a type representing the package database). The extraPkgConf flag is supposed to represent the -package-db command line option.
  3. Once the database is parsed, the loaded packages are stored in the pkgDatabase dynflag which is a list of PackageConfigs

So, in order to add a package database to the current session we have to simply modify the extraPkgConfs dynflag. Actually, there is already a function present in the GHC source that does exactly what we need: addPkgConfRef :: PkgConfRef -> DynP (). Unfortunately it’s not exported so we can’t use it in our own code. I rolled my own functions that I am using in the interactive-diagrams project, feel free to copy them:

-- | Add a package database to the Ghc monad
#if __GLASGOW_HASKELL_ >= 707  
addPkgDb :: GhcMonad m => FilePath -> m ()
#else
addPkgDb :: (MonadIO m, GhcMonad m) => FilePath -> m ()
#endif
addPkgDb fp = do
  dfs <- getSessionDynFlags
  let pkg  = PkgConfFile fp
  let dfs' = dfs { extraPkgConfs = (pkg:) . extraPkgConfs dfs }
  setSessionDynFlags dfs'
#if __GLASGOW_HASKELL_ >= 707    
  _ <- initPackages dfs'
#else
  _ <- liftIO $ initPackages dfs'
#endif
  return ()

-- | Add a list of package databases to the Ghc monad
-- This should be equivalen to  
-- > addPkgDbs ls = mapM_ addPkgDb ls
-- but it is actaully faster, because it does the package
-- reintialization after adding all the databases
#if __GLASGOW_HASKELL_ >= 707      
addPkgDbs :: GhcMonad m => [FilePath] -> m ()
#else
addPkgDbs :: (MonadIO m, GhcMonad m) => [FilePath] -> m ()
#endif             
addPkgDbs fps = do 
  dfs <- getSessionDynFlags
  let pkgs = map PkgConfFile fps
  let dfs' = dfs { extraPkgConfs = (pkgs ++) . extraPkgConfs dfs }
  setSessionDynFlags dfs'
#if __GLASGOW_HASKELL_ >= 707    
  _ <- initPackages dfs'
#else
  _ <- liftIO $ initPackages dfs'
#endif       
  return ()
  • Packages module, contains other functions that modify/make use of extraPkgConfs

Outro

This was the second post in the series and we have seen how to add a package database to the GHC session. Stay tuned for more brief posts and updates.

On custom error handlers for the GHC API

Intro

It’s hard to get into writing code that uses GHC API. It’s huge there are so many options around and not a lot of introduction-level tutorials.

In this series of blog posts I’ll elaborate on some of the peculiar, interesting problems I’ve encountered during my experience writing code that uses GHC API and also provide various tips I find useful.

I have built for myself a small layer of helper functions that helped me with using GHC API for the interactive-diagrams project. The source can be found on GitHub and I plan on refactoring the code and releasing it separately.

Error handling

Today I would like to talk about setting your own error handlers for GHC API. By default you can expect GHC to spew all the errors onto your screen, but for my purposes I wanted to log them.

Naturally at first I tried the following:

I am in need of setting up custom exception handlers when using GHC API to compile modules. Right now I have the following piece of code:

-- Main.hs:
import GHC
import GHC.Paths
import MonadUtils
import Exception
import Panic
import Unsafe.Coerce
import System.IO.Unsafe

-- I thought this code would handle the exception
handleException :: (ExceptionMonad m, MonadIO m)
                   => m a -> m (Either String a)
handleException m =
  ghandle (\(ex :: SomeException) -> return (Left (show ex))) $
  handleGhcException (\ge -> return (Left (showGhcException ge ""))) $
  flip gfinally (liftIO restoreHandlers) $
  m >>= return . Right

-- Initializations, needed if you want to compile code on the fly
initGhc :: Ghc ()
initGhc = do
  dfs <- getSessionDynFlags
  setSessionDynFlags $ dfs { hscTarget = HscInterpreted
                           , ghcLink = LinkInMemory }
  return ()

-- main entry point
main = test >>= print

test :: IO (Either String Int)
test = handleException $ runGhc (Just libdir) $ do
  initGhc
  setTargets =<< sequence [ guessTarget "file1.hs" Nothing ]
  graph <- depanal [] False
  loaded <- load LoadAllTargets
  -- when (failed loaded) $ throw LoadingException
  setContext (map (IIModule . moduleName . ms_mod) graph)
  let expr = "run"
  res <- unsafePerformIO . unsafeCoerce <$> compileExpr expr
  return res

-- file1.hs:
module Main where

main = return ()

run :: IO Int
run = do
  n <- x
  return (n+1)

The problem is when I run the ‘test’ function above I receive the following output:

h> test

test/file1.hs:4:10: Not in scope: `x'

Left "Cannot add module Main to context: not a home module"
it :: Either String Int

What the ..? My exception handler did catch the error, but:

  1. A strange one
  2. The error I intended to catch got

Is there a way to fix this?

Solution

I even asked this problem on the Haskell-Cafe mailing list, but the folks there don’t seem to be very keen on GHC/GHC API (which is understandable) and I haven’t got any answers.

But thanks to my mentor Luite Stegeman we’ve found the solution.

Errors are handled using the LogAction specified in the DynFlags for your GHC session. So to fix this you need to change ‘log_action’ parameter in dynFlags. For example, you can do this:

initGhc = do
  ..
  ref <- liftIO $ newIORef ""
  dfs <- getSessionDynFlags
  setSessionDynFlags $ dfs { hscTarget  = HscInterpreted
                           , ghcLink    = LinkInMemory
                           , log_action = logHandler ref -- ^ this
                           }

-- LogAction == DynFlags -> Severity -> SrcSpan -> PprStyle -> MsgDoc -> IO ()
logHandler :: IORef String -> LogAction
logHandler ref dflags severity srcSpan style msg =
  case severity of
     SevError ->  modifyIORef' ref (++ printDoc)
     SevFatal ->  modifyIORef' ref (++ printDoc)
     _        ->  return () -- ignore the rest
  where cntx = initSDocContext dflags style
        locMsg = mkLocMessage severity srcSpan msg
        printDoc = show (runSDoc locMsg cntx)

Outro

That’s the first tip and the first post in the series, stay tuned for more updates.

Pastebin update

I have updated the pastebin design and added some useful features.

Among with some minor tweaks the main changes are:

  • Author & title field added
  • Slick bootstrap design including buttons, pills and other web two oh stuff
  • Gallery of random images from the pastebin database
  • Two modes for viewing a paste: view mode and edit mode (edit mode still lacks a sophisticated JS editor)
  • Code highlighting in the view mode
  • Installed all the Acme packages on the server

Check the new website out: http://paste.hskll.org

If you have any suggestions regarding the design or the functionality of the web site please don’t hesitate to mail me or leave a comment.

View mode

View mode

New paste

New paste

Interactive-diagrams GSoC progress report

Intro

As some of you may already know, I’ve published the first demo version of the interactive-diagrams online, it can be found at http://paste.hskll.org (thanks to my mentor Luite Stegeman for hosting). It’s not very interactive yet, but it’s a good start. At the same time it took me a while to get everything up and running so in this blog post I would like to describe and discuss the overall structure and design of the project along with some details about the vast number of security restrictions that are being used.

Please note that http://paste.hskll.org is just a demo version and I can guarantee neither the safety of your pastes nor the uptime of the app. The ‘release notes’ can be found here.

If you have any suggestions or bug reports don’t hesitate to mail me (difrumin аt gmail dоt com) or use the bugtracker.

System requirements

GNU/Linux operating system, GHC 7.7 (I think it’s possible to make the whole thing work with GHC 7.6 but I don’t have time to support it and test everything), lots of RAM. In order to use some security restrictions you will also need SELinux and cgroup.

High-level structure

The whole program consists of three main components (it would be better to say three main types of components since there are usually multiple workers in the system):

  • The web app (sources can be found in scotty-pastebin), powered by WAI, Scotty and Blaze;
  • The service app (eval-api/src-service);
  • Workers (eval-api/src).

The web server handles user requests, database logic, and renders the results. Workers are the processes that perform the actual evaluation. The service component is the one that handles the pool of workers, keeps track of how many workers are available and forks new workers if necessary. The web app does not communicate with workers without the permission of the service.

All the communication between the components is performed with the help of UNIX sockets.

Request example

Here’s an example workflow in the system:

  1. User connects to the web server, sends the request to evaluate a certain bit of code.
  2. Web server talks to the service, requesting a worker.
  3. Server reuses an existing worker if an idle one exists. Otherwise it forks a new one or blocks if the limit is reached.
  4. Worker, upon starting, loads the necessary libraries and applies security restrictions.
  5. The web server receives a worker and sends it a request for evaluation.
  6. The server waits, if there is no reply from the worker after a certain amount of time, it sends a message to the service saying that the worker timed out. If the web service receives the reply, it stores the result in the database and continues with the user request.
  7. When the service receives message about one if its workers it decides whether to kill/restart it or not. If the worker’s process has timed out or results in an error (eg: out of memory exception) then the service restarts it.

Component permissions

Setting up the right permissions for the components is a crucial part in creating a secure environment. Depending on what security restrictions you have enabled you might want to choose different permissions for the processes. On http://paste.hskll.org we use the full set of security restrictions and limits (see the next section) and it requires us to give the components certain permissions.

  • scotty-pastebin is run as a user in a multithreaded runtime;
  • eval-service is run as a superuser (required for setting up chrooted jails) in a single-threaded environment (required due to forking/SELinux restrictions, see the SELinux section for details), listens on the ‘control’ socket;
  • workers are forked from eval-service as root, but they change their processes’ uid as soon as possible, listens on the ‘workerN’ socket (opened prior to chroot’ing);

Additionally the whole thing runs in a VM.

See also this wiki page written by luite

Security limitations and restrictions

Interactive-digrams applies a whole lot of limitations to the worker processes, which can be configured using the following datatype:

data LimitSettings = LimitSettings
    { -- | Maximum time for which the code is allowed to run
      -- (in seconds)
      timeout     :: Int
      -- | Process priority for the 'nice' syscall.
      -- -20 is the highest, 20 is the lowest
    , niceness    :: Int
      -- | Resource limits for the 'setrlimit' syscall
    , rlimits     :: Maybe RLimits
      -- | The directory that the evaluator process will be 'chroot'ed
      -- into. Please note that if chroot is applied, all the pathes
      -- in 'EvalSettings' will be calculated relatively to this
      -- value.
    , chrootPath  :: Maybe FilePath
      -- | The UID that will be set after the call to chroot.
    , processUid  :: Maybe UserID
      -- | SELinux security context under which the worker 
      -- process will be running.
    , secontext   :: Maybe SecurityContext
      -- | A filepath to the 'tasks' file for the desired cgroup.
      -- 
      -- For example, if I have mounted the @cpu@ controller at
      -- @/cgroups/cpu/@ and I want the evaluator to be running in the
      -- cgroup 'idiaworkers' then the 'cgroupPath' would be
      -- @/cgroups/cpu/idiaworkers@
    , cgroupPath  :: Maybe FilePath
    } deriving (Eq, Show, Generic)

There is also a Default instance for LimitSettings and RLimits with most of the restrictions turned off:

defaultLimits :: LimitSettings
defaultLimits = LimitSettings
    { timeout    = 3
    , niceness   = 10
    , rlimits    = Nothing
    , chrootPath = Nothing
    , processUid = Nothing
    , secontext  = Nothing 
    , cgroupPath = Nothing
    }

Below I’ll briefly describe each limitation/restriction with some details.

Timeout & niceness

The timeout field specifies how much time (in seconds) the server waits for the worker. (Note: this is the only limitation that is controlled on the side of the web server. The corresponding procedure is processTimeout. We really want this to be run in the multithreaded environment)

Niceness is merely the value passed to the nice() syscall, nothing special.

rlimits

The resource limits are controlled by syscalls to setrlimit. The limits itself are defined in the RLimits datatype:

data RLimits = RLimits
    { coreFileSizeLimit :: ResourceLimits
    , cpuTimeLimit      :: ResourceLimits
    , dataSizeLimit     :: ResourceLimits
    , fileSizeLimit     :: ResourceLimits
    , openFilesLimit    :: ResourceLimits
    , stackSizeLimit    :: ResourceLimits
    , totalMemoryLimit  :: ResourceLimits
    } deriving (Eq, Show, Generic)

ResourceLimits itself is defined in System.Posix.Resource. For more information on resource limits see setrlimit(3).

Chrooted jail

In order to restrict the worker process we run it inside the chroot jail. The easiest way to create a fully working jail is to use debootstrap. It’s also necessary to install gcc and GHC libraries inside the jail.

mkdir
sudo debootstrap wheezy /idia/run/workers/worker1
sudo chmod  /idia/run/workers/worker1
cd /idia/run/workers/worker1
sudo mkdir -p ./home/
sudo chown  ./home/
cd ./home/
mkdir .ghc && sudo mount --bind ~/.ghc .ghc
mkdir .cabal && sudo mount --bind ~/.cabal .cabal
mkdir ghc && sudo mount --bind ~/ghc ghc # ghc libs
cd ../..
cp ~/interactive-diagrams/common/Helper.hs .
sudo chroot .
apt-get install gcc # inside the chroot

I tried installing Emdebian using multistrap to reduce the size of the jail, but GHC won’t run properly in that environment, complaining about librt.so (which was present in the system), so I decided to stick with debootstrap. If anyone knows how to avoid this problem with multistrap please mail me or leave a comment.

Process uid

This is the uid the worker process will run under. The socket file will also be created by the user with this uid.

SELinux

SELinux (Security-enhanced Linux) is a Linux kernel module providing mechanisms for enforcing fine-grained mandatory access control, brought to you by the creators of infamous PRISM!

SELinux allows the system administrator to control the security of the system by specifying in (modular) policy files the AVC (access vector cache) rules. The SELinux kernel module sits there and monitors all the syscalls and, if it finds something that is not explicitly allowed in the policy, it blocks it. (Well, actually something a little bit different is going on, but for the sake of simplicity I am leaving it out)

Everything on your system – files, network sockets, file handles, processes, directories – is labelled with a SELinux security context, which consists of a role, a user name (not related to the regular system user name) and a domain (also called type in some literature). In the policy file you specific which domains are allowed to perform various actions on other domains. A typical piece of the policy file will look like this:

allow myprocess_t self:udp_socket { create connect };
allow myprocess_t bin_t:file { execute };

The first line states that the process from the domain myprocess_t is allowed to create and connect to the UDP sockets of the same domain. The second line allows a process in that domain to execute files of type bin_t (usually files in /bin/ and /usr/bin).

Note: the secontext field actually contains only the
security domain. When the worker process is changing the security
context it uses the same user/resource as it originally had.

In our SELinux policy we have several domains:

  • idia_web_t – the domain under which the scotty-pastebin run
  • idia_web_exec_t – the domain of the scotty-pastebin executable and other files associated with that binary
  • idia_service_t – the domain under which the eval-service run
  • idia_service_exec_t – the domain of the eval-service executable and other files associated with that binary
  • idia_service_sock_t – UNIX socket files used for communication
  • idia_db_t, idia_pkg_t, idia_web_common_t – database files, packages, html files, templates and other stuff
  • idia_worker_env_t – chroot’ed environment in which the worker operates
  • idia_restricted_t – the most restricted domain in which the workers run and evaluate code

The reason we made the service program run in a single-threaded environment is the following: if we run it in the multi-threaded environment (like we wanted) the worker processes want to have access to file descriptors, inherited from the idia_service_t, which, of course, is dangerous and should not be allowed.

I personally don’t enjoy using SELinux very much. It’s very hard to configure it and among its shortcomings I can list the fact that there is no distinction between file types and process types; there is no proper separation, even when using the modular policy, as the duplicated types are checked when you load the module and there is no way (that I know of) to easily introduce a fresh unused type. And there is this thing that puzzled me for quite a while: home directories are treated specially. Even if you configure a subdirectory in your home dir to have a specific security context, restorecon won’t correctly install the context specified in the policy. You actually have to set he context yourself, using chcon.

Cgroups

Cgroups is the system that can be used to aid the way Linux schedules CPU time/shares, distributes memory to the processes. It does so by organizing processes into hierarchical groups with configured behaviour.

Installing cgroups on debian is somewhat tricky, because the package is a little bit weird.

sudo apt-get install cgroup-bin libcgroup1
sudo cgconfigparser -l ~/interactive-diagrams/cgconfig.conf

For our purposes we have a cgroup called idiaworkers. We also mount the cpu controller on /cgroups/cpu:

$> ls -l /cgroups/cpu/
total 0
-rw-r--r--. 1 root root 0 Jul 12 16:22 cgroup.clone_children
--w--w--w-. 1 root root 0 Jul 12 16:22 cgroup.event_control
-rw-r--r--. 1 root root 0 Jul 12 16:22 cgroup.procs
-rw-r--r--. 1 root root 0 Jul 12 16:22 cpu.shares
drwxr-xr-x. 2 root root 0 Jul 12 16:22 idiaworkers
-rw-r--r--. 1 root root 0 Jul 12 16:22 notify_on_release
-rw-r--r--. 1 root root 0 Jul 12 16:22 release_agent
-rw-r--r--. 1 root root 0 Jul 12 16:22 tasks
$> ls -l /cgroups/cpu/idiaworkers
total 0
-rw-r--r--. 1 root    root 0 Jul 12 16:22 cgroup.clone_children
--w--w--w-. 1 root    root 0 Jul 12 16:22 cgroup.event_control
-rw-r--r--. 1 root    root 0 Jul 12 16:22 cgroup.procs
-rw-r--r--. 1 root    root 0 Jul 12 16:22 cpu.shares
-rw-r--r--. 1 root    root 0 Jul 12 16:22 notify_on_release
-rw-r--r--. 1 vagrant root 0 Jul 14 06:21 tasks

In order to modify how much CPU time our group gets, we write to the cpu.shares file: sudo echo 100 > /cgroups/cpu/idiaworkers/cpu.shares. If we want to add the task/process to the group we simply append the tasks file: echo $PID >> /cgroups/cpu/idiaworkers/tasks. The workers append themselves to the task file automatically (if the cgroup restrictions are enabled in the LimitSettings).

Open problems/requests

I am still not sure how do I write tests for this project. Do I write tests for my GHC API wrappers? Do I write tests for my workers pool? I probably should take a look how similar projects handles those.

Outro

So, as you can see, we have something working here and now that we manage to take the initial steps it will be much easier for us to push changes and make them available for public to use and comment on. There is still a long way to come. The code needs some serious cleanup (we’ve switched the design model a couple of weeks ago, which affected the internal structure seriously), the documentation needs to be written. And of course new features are waiting to be implemented :) We will be supporting multiple UIDs for workers and looking into using LXC for simplifying the setup process too.

I would like to thank augur and luite for their editorial feedback.

Stay tuned for the next posts about configuring the program for evaluation settings and reusing the components from the library.

The Protocol Problem

In the interactive-diagrams project we have a bunch of components (processes) running independently and even under different UIDs. They however need to communicate with each other, and we’ve decided to go full UNIX-way (since already most of our code depend on POSIX-compatibility) and use IPC/UNIX-sockets for communication.

Originally, I’ve implemented the following protocol for sending the
data around:

When sending data:

  1. Encode the data using the functions from the cereal package.
  2. Take the ‘length’ of the resulting bytestring, send it over the
    socket.
  3. Send the encoded data on the next line.

Upon receiving data:

  1. Read the first line, deserialize it to an x :: Int
  2. Read the the next x bytes, deserialize the data.

A programmer experienced in the area would have probably already
spotted an error in this approach, but for me it took some time to
find a bug, once I’ve realised that I was getting deserialization
errors from time to time.

The problem of course is that I was relying on reading lines from the
socket, not bytes. For example, the number 2623 on my 64bit system has
serialized to something with the newline character in it

Q6I3.png

The solution to this problem is to read the first ‘8’ (or ‘4’) bytes
to get the length of the upcoming data. To make sure that the number
of bytes representing the length of the data is constant on all the
platforms I’ve switched to using Word32.

This approach too, of course, is not prone to errors. If you are
sending and receiving a lot of data consider using a streaming
library.

PS: We’ve planned originally to release the first public alpha-version yesterday, but it turned out that it is actually taking more time configuring SELinux, chroots and other security measure than expected. I’ve also changed the general structure of the project, I am using a different design model than I had a week ago. The new changes would make the application more salable and the resulting library can (and most likely will) be reused for other stuff. Stay tuned!

Agile development and deployment in the cloud with Haskell and vado

In this post I would like to give you an update on vado – a piece of
software for running programs on vagrant VMs (or any other ssh server,
actually), projects I’ve contributed briefly to.

1 New build system

The old build system for ghcjs was a little bit messy. Basically, it was
just one Puppet configuration file that contained a hardcoded shell
script as a resource that is supposed to be written to the home
directory and executed. I decided to clean it up a notch and take more
of a Puppet approach to the whole thing.

You can find the new set of build script on the GitHub:
https://github.com/ghcjs/ghcjs-build

And since the errors are now printed to the screen it’s
easy to see which stage the build is going through and if anything
goes wrong you see an error trace for the current stage.

The prebuilt version has also been updated by
Luite Stegeman.

2 Vado

2.1 Vado intro

Hamish Mackenzie and I have been working on vado – a quick way to run
commands on a remote ssh server. Just mount the directory you want to
run the command in using sshfs, in that directory (or its
subdirectory) run vado like this:

vado ls -l

vado will run ‘mount’ to identify the user account, server name and
the remote directory to run the command in. It will then run ssh to
connect to the server and run the command.

You can also pass ssh options like this:

vado -t htop

This tells vado to pass -t to ssh (forces pseudo-tty allocation and
makes programs like vim and htop work nicely).

I will explain below how to set up vado for multiple remote
servers/sshfs mount points and how to develop Haskell projects on a
remote server/VM nicely using Emacs and ghc-mod.

2.2 .vadosettings

Vado is not tied to vagrant, but can be used with it and is faster
than vagrant ssh. If the user and host detected in mount are
specified in the ~/.vadosettings file, then the specified key and
port will be used.

The contents of the ~/.vadosettings file is basically a Haskell
list of MountSettings datastructures and we use standard Read and
Show type-classes for serialization.

The MountSettings data type is defined as follows:

-- | Mount point settings
data MountSettings = MountSettings {
    sshfsUser :: Text
  , sshfsHost :: Text
  , sshfsPort :: Int
  , idFile :: FilePath
  } deriving (Show, Read)


If the file is not present or incorrectly formatted
then the default settings for vagrant will be used:

  • User: vagrant
  • Host: 127.0.0.1
  • Port: 2222
  • Key file: ~/.vagrant.d/insecure_private_key

2.2.1 Example .vadosettings file

An example settings file might look like this:

[
  MountSettings {
    sshfsUser = "vagrant"
  , sshfsHost = "localhost"
  , sshfsPort = 2222
  , idFile = "/Users/dan/.vagrant.d/insecure_private_key"
  }, 
  MountSettings {
    sshfsUser = "admin"
  , sshfsHost = "server.local"
  , sshfsPort = 2233
  , idFile = "/Users/dan/keys/local_server_key"
  }
]

2.3 Vamount

Of course, using vado requires mounting the sshfs beforehand. But
it gets tedious typing out

sshfs vagrant@localhost:/home/vagrant ../vm/ -p2222
-reconnect,defer_permissions,negative_vncache,volname=ghcjs,IdentityFile=~/.vagrant.d/insecure_private_key

every time. A tool called vamount which is bundled together
with vado can be used for mounting remote filesystems based on
~/.vadosettings file.

You can use it like this:

vamount [ssh options] remote_path [profile #]

The remote_path from the remote server specified in the
~/.vadosettings file under number [profile #] will be mounted in the
current directory using sshfs.

The profile number count starts from 1. If the [profile #] is absent
or is 0 then the default (vagrant) configuration will be used.

2.4 Vado and ghc-mod

ghc-mod is a backend designed command to enrich Haskell programming on
editors like Emacs and Vim and it also features a front-end for Emacs
as a set of elisp scripts. It’s a really cool piece of software and if
you have not tried it yet I highly recommend you to invest into
installing and using it.

What we would like, however, is to edit files on the mounted
filesystem using Emacs on the host machine, but run ghc-mod inside the
VM. In order to do that we need to install ghc-mod both on our host
machine and on the VM.

While installing ghc-mod on the host machine running the latest
haskell-platform is pretty straightforward it is harder to do so on
the VM running GHC 7.7 due to the fact that many libraries are not
ready for GHC 7.7 and base 4.7 yet. We have to resort to installing
most of the things from source.

# run this on the guest machine
mkdir ghcmod && cd ghcmod

# patching installing convertible
cabal unpack convertible
cd convertible*
wget http://co-dan.github.io/patched/convertible.patch
patch -p1 Data/Convertible/Utils.hs convertible.patch
cabal install
cd ..

# installing ghc-syb-utils
git clone https://github.com/co-dan/ghc-syb.git
cd ghc-syb/utils/
cabal install
cd ../..

# finally getting and installing ghc-mod
git clone https://github.com/co-dan/ghc-mod.git
cd ghc-mod
cabal install


Ghc-mod itself uses the GHC API extensively so it’s no surprise we
have to change at least some code. Now that we have installed ghc-mod
on the guest VM we need to set up our host’s Emacs configuration to
communicate properly with the VM. First of all put this in your Emacs
config:

(setq load-path (cons "~/Library/Haskell/ghc-7.6.3/lib/ghc-mod-2.0.3/share" load-path))
(autoload 'ghc-init "ghc" nil t)
(add-hook 'haskell-mode-hook (lambda () (ghc-init)))
;; (setq ghc-module-command "ghc-mod")
(setq ghc-module-command "~/vado-ghc-mod.sh")

~/vado-ghc-mod.sh should contain the following:

#!/bin/bash
VADO=/Users/dan/Library/Haskell/bin/vado
LOCAL_PATH=/Users/dan/projects/ghcjs/mnt/
REMOTE_PATH=/home/vagrant/
$VADO -t ghc-mod ${@//$LOCAL_PATH/$REMOTE_PATH} | sed "s,$REMOTE_PATH,$LOCAL_PATH,g"

I know that it’s a hack, but it does work and I guess that’s what
shell scripts are for ;)

Now go to ~/.bashrc on the guest machine and make sure that the
PATH variable is set correctly:

PATH=/home/vagrant/ghcjs/bin:/home/vagrant/.cabal/bin:/home/vagrant/ghc/bin:/home/vagrant/jsshell:/home/vagrant/node-v0.10.10-linux-x86/bin:$PATH

# PATH is set *before* this line:
[ -z "$PS1" ] && return

# <snip>

And that’s it, you should be done!

Before (ghc-mod running on the host machine):
ghcmod-before

After (ghc-mod running inside ghcjs-build VM):
ghcmod-after

3 Conclusion and future work

We’ve seen how a small but useful tool vado can make our life easier if
we want to develop Haskell projects on a remote server or on a
virtual machine. You can get Vado from GitHub: https://github.com/hamishmack/vado

Next week we are planning on releasing our first version of
interactive-diagrams pastesite (not going to be very interactive
though) and writing out its security model.

Meanwhile check Luite’s post on using Sodium FRP library for creating
Functional Reactive Web interfaces. It’s astonishing how easily you
can just get a FRP library, compile to JavaScript and make nifty web
apps with it.