While working on discogs2pg, a tool for importing Discogs’ data dumps into Postgre efficiently, I ran into a space leak.
So, in the beginning, things worked nice and in constant memory.
I opened the XML as a
Text.XML.Expat.Tree took that and gave me a lazy tree of nodes, I made a list of records with those and finally, they were copied by
Database.PostgreSQL.Simple.Copy into de database.
The parsing and storing of the first XML file was done, so I move on to the next one, simplifying and refactoring as needed, all while thinking “oh my Haskell is great, so so good”.
Suddenly, while trying the thing, I realized something was eating memory. My first suspect was firefox, but
htop quickly verified it was the parser.
I’m using stack, which is great by the way, for the first time in a real project, and enabling profiling was super easy.
First I ran
stack build --executable-profiling --library-profiling and it cried about
base missing. This was because I didn’t have the profiling base library installed, something
apt fixed right away. Then I started trying different flags, since profiling in haskell is not really well documented, even though it’s really well supported.
and saw this, which may well be the epitomical example of a space leak:
So I googled ARR_WORDS and found it has to do with ByteStrings (I use a lot of those here because performance). The thing is, I have no idea what they have to do with the space leak because I fixed it with a little change that doesn’t involve them.
But first, let me introduce you to the program.
The program ends in the
store function which takes a list of things that implement Storable, escapes them (generates some
ByteString.Builder) and makes them
ByteString for Postgre’s COPY. The class looks like this:
Ignore the weird
[a] -> Stuff, that’s gonna change in the future. That just says “a collection of ‘a’ has this attributes”.
So, each Storable can become a
[Builder] (one for each table) and a list of them has a
[TableInfo] (same size as the
[Builder]), which is just a
String (the table name) and a
[String] (the table columns). This is used in
store like this:
store :: (Show a, Storable a) => ConnectionInfo -> [a] -> IO () store conf values = do -- open a connection per table, TRUNCATE the table -- and start the COPY ... FROM STDIN conns <- forM (getTables values) $ \table -> do conn <- connectPostgreSQL conf begin conn _ <- execute_ conn $ fromString ("TRUNCATE " <> tableName table) copy_ conn $ toQuery table return conn -- then for each value forM_ values $ \val -> -- we get the [Builder] let rows = toRows val -- zip them with the connections list forM_ (zip conns rows) $ \(c, r) -> -- make them ByteString and send it to Postgre mapM_ (putCopyData c) (toChunks $ toLazyByteString r) -- finally, putCopyEnd for each connection, printing the table name -- in the process and the count of copied elements forM_ (zip (getTables values) $ conns) $ \(table, conn) -> do n <- putCopyEnd conn commit conn putStrLn $ tableName table <> " = " <> show n
Pretty imperative looking, but it works like a charm. Except for the space leak.
Now, the first thing I tried was refactoring the inner loop, avoiding currying and composition but that didn’t work.
I googled a while but nothing.
I was getting ready for making all my records strict, which I knew was not the problem since it worked perfectly a day ago, and then fortunately while looking around, I got rid of the duplicate
(getTables values) which was just bothering me … and holy shit that was it!
The fix is so simple and misterious it’s not even funny:
Which generates this pretty graph:
Somehow a list of
String was causing a 60MB space leak of ARR_WORDS.
I love Haskell.
UPDATE: The explanation
Over at reddit pycube offers what seems to be a very reasonable explanation (link).
Basically, even though
getTables doesn’t depend on
values, GHC doesn’t really know this, all it can see is
So in the leaky version, it doesn’t GC the things inside
values because it thinks it’s gonna need them later for the last
getTables! Meanwhile in the fixed version, nothing prevents those things from being GCed.
That’s also why the profile says the leak is from ARR_WORDS, because what was leaking was
Thanks to pycube and the folks over at reddit for helping with this!