Friday, July 29, 2016

How to Build a Multitenant Application: A Hibernate Tutorial

When we talk about cloud applications where each client has their own separate data, we need to think about how to store and manipulate this data. Even with all the great NoSQL solutions out there, sometimes we still need to use the good old relational database. The first solution that might come to mind to separate data is to add an identifier in every table, so it can be handled individually. That works, but what if a client asks for their database? It would be very cumbersome to retrieve all those records hidden among the others.
Multitenancy Java EE Application with HibernateMultitenancy in Java is easier than ever with Hibernate.
The Hibernate team came up with a solution to this problem a while ago. They provide some extension points that enable one to control from where data should be retrieved. This solution has the option to control the data via an identifier column, multiple databases, and multiple schemas. This article will cover the multiple schemas solution.
So, let’s get to work!

Getting Started

If you are a more experienced Java developer and know how to configure everything, or if you already have your own Java EE project, you can skip this section.
First, we have to create a new Java project. I am using Eclipse and Gradle, but you can use your preferred IDE and building tools, such as IntelliJ and Maven.
If you want to use the same tools as me, you can follow these steps to create your project:
  • Install Gradle plugin on Eclipse
  • Click on File -> New -> Other…
  • Find Gradle (STS) and click Next
  • Inform a name and choose Java Quickstart for sample project
  • Click Finish
Great! This should be the initial file structure:
javaee-mt
|- src/main/java
|- src/main/resources
|- src/test/java
|- src/test/resources
|- JRE System Library
|- Gradle Dependencies
|- build
|- src
|- build.gradle
You can delete all files that come inside the source folders, as they are just sample files.
To run the project, I use Wildfly, and I will show how to configure it (again you can use your favorite tool here):
  • Download Wildfly: http://wildfly.org/downloads/ (I am using version 10)
  • Unzip the file
  • Install the JBoss Tools plugin on Eclipse
  • On the Servers tab, right-click any blank area and choose New -> Server
  • Choose Wildfly 10.x (9.x also works if 10 is not available, depending on your Eclipse version)
  • Click Next, choose Create New Runtime (next page) and click Next again
  • Choose the folder where you unzipped Wildfly as Home Directory
  • Click Finish
Now, let’s configure Wildfly to know the database:
  • Go to the bin folder inside your Wildfly folder
  • Execute add-user.bat or add-user.sh (depending on your OS)
  • Follow the steps to create your user as Manager
  • In Eclipse, go to the Servers tab again, right-click on the server you created and select Start
  • On your browser, access http://localhost:9990, which is the Management Interface
  • Enter the credentials of the user you just created
  • Deploy the driver jar of your database:
    1. Go to the Deployment tab and click Add
    2. Click Next, choose your driver jar file
    3. Click Next and Finish
  • Go to the Configuration tab
  • Choose Subsystems -> Datasources -> Non-XA
  • Click Add, select your database and click Next
  • Give a name to your data source and click Next
  • Select the Detect Driver tab and choose the driver you just deployed
  • Enter your database information and click Next
  • Click Test Connection if you want to make sure the information of the prior step is correct
  • Click Finish
  • Go back to Eclipse and stop the running server
  • Right-click on it, select Add and Remove
  • Add your project to the right
  • Click Finish
Alright, we have Eclipse and Wildfly configured together!
This is all the configurations required outside of the project. Let’s move on to the project configuration.

Bootstrapping Project

Now that we have Eclipse and Wildfly configured and our project created, we need to configure our project.
The first thing we are going to do is to edit build.gradle. This is how it should look:
apply plugin: 'java'
apply plugin: 'war'
apply plugin: 'eclipse'
apply plugin: 'eclipse-wtp'

sourceCompatibility = '1.8'
compileJava.options.encoding = 'UTF-8'

compileJava.options.encoding = 'UTF-8'
compileTestJava.options.encoding = 'UTF-8'


repositories {
    jcenter()
}

eclipse {
    wtp {
    }
}

dependencies {
    providedCompile 'org.hibernate:hibernate-entitymanager:5.0.7.Final'
    providedCompile 'org.jboss.resteasy:resteasy-jaxrs:3.0.14.Final'
    providedCompile 'javax:javaee-api:7.0'
}
The dependencies are all declared as “providedCompile”, because this command doesn’t add the dependency in the final war file. Wildfly already has these dependencies, and it would cause conflicts with the app’s ones otherwise.
At this point, you can right-click your project, select Gradle (STS) -> Refresh All to import the dependencies we just declared.
Time to create and configure the “persistence.xml” file, the file that contains the information that Hibernate needs:
  • In the src/main/resource source folder, create a folder called META-INF
  • Inside this folder, create a file named persistence.xml
The content of the file must be the something like the following, changing jta-data-source to match the datasource you created in Wildfly and the package com.toptal.andrehil.mt.hibernate to the one you are going to create in the next section (unless you choose the same package name):
<?xml version="1.0" encoding="UTF-8" ?>
<persistence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd"
    version="2.0" xmlns="http://java.sun.com/xml/ns/persistence">
    <persistence-unit name="pu">
        <jta-data-source>java:/JavaEEMTDS</jta-data-source>
        <properties>
            <property name="hibernate.multiTenancy" value="SCHEMA"/>
            <property name="hibernate.tenant_identifier_resolver" value="com.toptal.andrehil.mt.hibernate.SchemaResolver"/>
            <property name="hibernate.multi_tenant_connection_provider" value="com.toptal.andrehil.mt.hibernate.MultiTenantProvider"/>
        </properties>
    </persistence-unit>
</persistence>

Hibernate Classes

The configurations added to persistence.xml point to two custom classes MultiTenantProvider and SchemaResolver. The first class is responsible for providing connections configured with the right schema. The second class is responsible for resolving the name of the schema to be used.
Here is the implementation of the two classes:
public class MultiTenantProvider implements MultiTenantConnectionProvider, ServiceRegistryAwareService {

    private static final long serialVersionUID = 1L;
    private DataSource dataSource;

    @Override
    public boolean supportsAggressiveRelease() {
        return false;
    }
    @Override
    public void injectServices(ServiceRegistryImplementor serviceRegistry) {
        try {
            final Context init = new InitialContext();
            dataSource = (DataSource) init.lookup("java:/JavaEEMTDS"); // Change to your datasource name
        } catch (final NamingException e) {
            throw new RuntimeException(e);
        }
    }
    @SuppressWarnings("rawtypes")
    @Override
    public boolean isUnwrappableAs(Class clazz) {
        return false;
    }
    @Override
    public <T> T unwrap(Class<T> clazz) {
        return null;
    }
    @Override
    public Connection getAnyConnection() throws SQLException {
        final Connection connection = dataSource.getConnection();
        return connection;
    }
    @Override
    public Connection getConnection(String tenantIdentifier) throws SQLException {
        final Connection connection = getAnyConnection();
        try {
            connection.createStatement().execute("SET SCHEMA '" + tenantIdentifier + "'");
        } catch (final SQLException e) {
            throw new HibernateException("Error trying to alter schema [" + tenantIdentifier + "]", e);
        }
        return connection;
    }
    @Override
    public void releaseAnyConnection(Connection connection) throws SQLException {
        try {
            connection.createStatement().execute("SET SCHEMA 'public'");
        } catch (final SQLException e) {
            throw new HibernateException("Error trying to alter schema [public]", e);
        }
        connection.close();
    }
    @Override
    public void releaseConnection(String tenantIdentifier, Connection connection) throws SQLException {
        releaseAnyConnection(connection);
    }
}
The syntax being used in the statements above work with PostgreSQL and some other databases, this must be changed in case your database has a different syntax to change the current schema.
public class SchemaResolver implements CurrentTenantIdentifierResolver {

    private String tenantIdentifier = "public";

    @Override
    public String resolveCurrentTenantIdentifier() {
        return tenantIdentifier;
    }
    @Override
    public boolean validateExistingCurrentSessions() {
        return false;
    }
    public void setTenantIdentifier(String tenantIdentifier) {
        this.tenantIdentifier = tenantIdentifier;
    }
}
At this point, it is already possible to test the application. For now, our resolver is pointing directly to a hard-coded public schema, but it is already being called. To do this, stop your server if it is running and start it again. You can try to run it in debug mode and place breakpoint at any point of the classes above to check if it is working.

Practical Use Of The Resolver

So, how could the resolver actually contain the right name of the schema?
One way to achieve this is to keep an identifier in the header of all requests and then create a filter to inject the name of the schema.
Let’s implement a filter class to exemplify the usage. The resolver can be accessed through Hibernate’s SessionFactory, so we will take advantage of that to get it and inject the right schema name.
@Provider
public class AuthRequestFilter implements ContainerRequestFilter {

    @PersistenceUnit(unitName = "pu")
    private EntityManagerFactory entityManagerFactory;

    @Override
    public void filter(ContainerRequestContext containerRequestContext) throws IOException {
        final SessionFactoryImplementor sessionFactory = ((EntityManagerFactoryImpl) entityManagerFactory).getSessionFactory();
        final SchemaResolver schemaResolver = (SchemaResolver) sessionFactory.getCurrentTenantIdentifierResolver();

        final String username = containerRequestContext.getHeaderString("username");
        schemaResolver.setTenantIdentifier(username);
    }
}
Now, when any class gets an EntityManager to access the database, it will be already configured with the right schema.
For the sake of simplicity, the implementation shown here is getting the identifier directly from a string in the header, but it is a good idea to use an authentication token and store the identifier in the token. If you are interested in knowing more about this subject, I suggest taking a look at JSON Web Tokens (JWT). JWT is a nice and simple library for token manipulation.

How to Use All of This

With everything configured, there is nothing else needed to do in your entities and/or classes that interact with EntityManager. Anything you run from an EntityManager will be directed to the schema resolved by the created filter.
Now, all you need to do is to intercept requests on the client side and inject the identifier/token in the header to be sent to the server side.
In a real application, you will have a better means of authentication. The general idea of multitenancy, however, will remain the same.
The link at the end of the article points to the project used to write this article. It uses Flyway to create 2 schemas and contains an entity class called Car and a rest service class called CarService that can be used to test the project. You can follow all the steps below, but instead of creating your own project, you can clone it and use this one. Then, when running you can use a simple HTTP client (like Postman extension for Chrome) and make a GET request to http://localhost:8080/javaee-mt/rest/cars with the headers key:value:
  • username:joe; or
  • username:fred.
By doing this, the requests will return different values, which are in different schemas, one called joe and the other one called “fred”.

Final Words

This is not the only solution to create multitenancy applications in the Java world, but it is a simple way to achieve this.
One thing to keep in mind is that Hibernate doesn’t generate DDL when using multitenancy configuration. My suggestion is to take a look at Flyway or Liquibase, which are great libraries to control database creation. This is a nice thing to do even if you are not going to use multitenancy, as the Hibernate team advises to not use their auto database generation in production.
The source code used to create this article and environment configuration can be found at github.com/andrehil/JavaEEMT

Friday, July 22, 2016

Hunting Down Memory Issues In Ruby: A Definitive Guide

I’m sure there are some lucky Ruby developers out there who will never run into issues with memory, but for the rest of us, it’s incredibly challenging to hunt down where memory usage is getting out of hand and fix it. Fortunately, if you’re using a modern Ruby (2.1+), there are some great tools and techniques available for dealing with common issues. It could also be said that memory optimization can be fun and rewarding although I may be alone in that sentiment.
Hunting Down Memory Issues In Ruby

If you thought bugs were pesky, wait until you hunt for memory issues.
It’s Not a Memory Leak!
# common.rb
require "active_record"
require "active_support/all"
require "get_process_mem"
require "sqlite3"
ActiveRecord::Base.establish_connection(
adapter: "sqlite3",
database: "people.sqlite3"
)
class Person < ActiveRecord::Base; end
def print_usage(description)
mb = GetProcessMem.new.mb
puts "#{ description } - MEMORY USAGE(MB): #{ mb.round }"
end
def print_usage_before_and_after
print_usage("Before")
yield
print_usage("After")
end
def random_name
(0...20).map { (97 + rand(26)).chr }.join
end
# build_arrays.rb
require_relative "./common"
ARRAY_SIZE = 1_000_000
times = ARGV.first.to_i
print_usage(0)
(1..times).each do |n|
foo = []
ARRAY_SIZE.times { foo << {some: "stuff"} }
print_usage(n)
end
$ ruby build_arrays.rb 10
0 - MEMORY USAGE(MB): 17
1 - MEMORY USAGE(MB): 330
2 - MEMORY USAGE(MB): 481
3 - MEMORY USAGE(MB): 492
4 - MEMORY USAGE(MB): 559
5 - MEMORY USAGE(MB): 584
6 - MEMORY USAGE(MB): 588
7 - MEMORY USAGE(MB): 591
8 - MEMORY USAGE(MB): 603
9 - MEMORY USAGE(MB): 613
10 - MEMORY USAGE(MB): 621
$ ruby build_arrays.rb 40
0 - MEMORY USAGE(MB): 9
1 - MEMORY USAGE(MB): 323
...
32 - MEMORY USAGE(MB): 700
33 - MEMORY USAGE(MB): 699
34 - MEMORY USAGE(MB): 698
35 - MEMORY USAGE(MB): 698
36 - MEMORY USAGE(MB): 696
37 - MEMORY USAGE(MB): 696
38 - MEMORY USAGE(MB): 696
39 - MEMORY USAGE(MB): 701
40 - MEMORY USAGE(MB): 697
Do not panic if you see a sudden rise in the memory usage of your app. Apps can run out of memory for all sorts of reasons - not just memory leaks.
Divide and Conquer
Isolating Memory Usage Hotspots
# people.rb
require_relative "./common"
def run(number)
Person.delete_all
names = number.times.map { random_name }
names.each do |name|
Person.create(name: name)
end
records = Person.all.to_a
File.open("people.txt", "w") { |out| out << records.to_json }
end
# before_and_after.rb
require_relative "./people"
print_usage_before_and_after do
run(ARGV.shift.to_i)
end
$ ruby before_and_after.rb 10000
Before - MEMORY USAGE(MB): 37
After - MEMORY USAGE(MB): 96
# profile.rb
require "memory_profiler"
require_relative "./people"
report = MemoryProfiler.report do
run(1000)
end
report.pretty_print(to_file: "profile.txt")
allocated memory by gem
-----------------------------------
17520444 activerecord-4.2.6
7305511 activesupport-4.2.6
2551797 activemodel-4.2.6
2171660 arel-6.0.3
2002249 sqlite3-1.3.11
...
allocated memory by file
-----------------------------------
2840000 /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activesupport-4.2.6/lib/activ
e_support/hash_with_indifferent_access.rb
2006169 /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activerecord-4.2.6/lib/active
_record/type/time_value.rb
2001914 /Users/bruz/code/mem_test/people.rb
1655493 /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activerecord-4.2.6/lib/active
_record/connection_adapters/sqlite3_adapter.rb
1628392 /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activesupport-4.2.6/lib/activ
e_support/json/encoding.rb
# File.open("people.txt", "w") { |out| out << records.to_json }
$ ruby before_and_after.rb 10000
Before: 36 MB
After: 47 MB
# records = Person.all.to_a
records = Person.all
# File.open("people.txt", "w") { |out| out << records.to_json }
$ ruby before_and_after.rb 10000
Before: 36 MB
After: 40 MB
Deserialization
Just because you have limited memory doesn't mean you cannot parse large XML or JSON documents safely. Streaming deserializers allow you to incrementally extract whatever you need from these documents and still keep the memory footprint low.
# parse_with_from_xml.rb
require_relative "./common"
print_usage_before_and_after do
# From http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml
file = File.open(File.expand_path("../mondial-3.0.xml", __FILE__))
hash = Hash.from_xml(file)["mondial"]["continent"]
puts hash.map { |c| c["name"] }.join(", ")
end
$ ruby parse_with_from_xml.rb
Before - MEMORY USAGE(MB): 37
Europe, Asia, America, Australia/Oceania, Africa
After - MEMORY USAGE(MB): 164
# parse_with_ox.rb
require_relative "./common"
require "ox"
class Handler < ::Ox::Sax
def initialize(&block)
@yield_to = block
end
def start_element(name)
case name
when :continent
@in_continent = true
end
end
def end_element(name)
case name
when :continent
@yield_to.call(@name) if @name
@in_continent = false
@name = nil
end
end
def attr(name, value)
case name
when :name
@name = value if @in_continent
end
end
end
print_usage_before_and_after do
# From http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml
file = File.open(File.expand_path("../mondial-3.0.xml", __FILE__))
continents = []
handler = Handler.new do |continent|
continents << continent
end
Ox.sax_parse(handler, file)
puts continents.join(", ")
end
$ ruby parse_with_ox.rb
Before - MEMORY USAGE(MB): 37
Europe, Asia, America, Australia/Oceania, Africa
After - MEMORY USAGE(MB): 37
Serialization
# to_json.rb
require_relative "./common"
print_usage_before_and_after do
File.open("people.txt", "w") { |out| out << Person.all.to_json }
end
$ ruby to_json.rb
Before: 36 MB
After: 505 MB
# json_stream.rb
require_relative "./common"
require "json-write-stream"
print_usage_before_and_after do
file = File.open("people.txt", "w")
JsonWriteStream.from_stream(file) do |writer|
writer.write_object do |obj_writer|
obj_writer.write_array("people") do |arr_writer|
Person.find_each do |people|
arr_writer.write_element people.as_json
end
end
end
end
end
$ ruby json_stream.rb
Before: 36 MB
After: 56 MB
Being Lazy
# not_lazy.rb
require_relative "./common"
number = ARGV.shift.to_i
print_usage_before_and_after do
names = number.times
.map { random_name }
.map { |name| name.capitalize }
.map { |name| "#{ name } Jr." }
.select { |name| name[0] == "X" }
.to_a
end
$ ruby not_lazy.rb 1_000_000
Before: 36 MB
After: 546 MB
# lazy.rb
require_relative "./common"
number = ARGV.shift.to_i
print_usage_before_and_after do
names = number.times.lazy
.map { random_name }
.map { |name| name.capitalize }
.map { |name| "#{ name } Jr." }
.select { |name| name[0] == "X" }
.to_a
end
$ ruby lazy.rb 1_000_000
Before: 36 MB
After: 52 MB
def records
Enumerator.new do |yielder|
has_more = true
page = 1
while has_more
response = fetch(page)
response.records.each { |record| yielder << record }
page += 1
has_more = response.has_more
end
end
end
Conclusion

As with all forms of optimization, odds are that it will add code complexity, so it’s not worth doing unless there are measurable and significant gains.
Everything described here is done using the canonical MRI Ruby, version 2.2.4, although other 2.1+ versions should behave similarly.
When a memory issue is discovered, it’s easy to jump to the conclusion that there’s a memory leak. For example, in a web application, you may see that after you spin up your server, repeated calls to the same endpoint keep driving memory usage up higher with each request. There are certainly cases where legitimate memory leaks happen, but I’d wager they are vastly outnumbered by memory issues with this same appearance that aren’t actually leaks.
As a (contrived) example, let’s look at a bit of Ruby code that repeatedly builds a big array of hashes and discards it. First, here’s some code that’ll be shared throughout the examples in this post:
And the array builder:
The get_process_mem gem is just a convenient way to get the memory being used by the current Ruby process. What we see is the same behavior that was described above, a continual increase in memory usage.
However, if we run more iterations, we’ll eventually plateau.
Hitting this plateau is the hallmark of not being an actual memory leak, or that the memory leak is so small that it’s not visible compared to the rest of the memory usage. What may not be intuitive is why memory usage continues to grow after the first iteration. After all, it built a big array, but then promptly discarded it and started building a new one of the same size. Can’t it just use the space freed up by the previous array? The answer, which explains our problem, is no. Aside from tuning the garbage collector, you don’t have control over when it runs, and what we’re seeing in the build_arrays.rb example is new memory allocations being made prior to garbage collection of our old, discarded objects.
I should point out that this isn’t some sort of horrible memory management issue specific to Ruby, but is generally applicable to garbage-collected languages. Just to reassure myself of this, I reproduced essentially the same example with Go and saw similar results. However, there are Ruby libraries that make it easy to create this sort of memory issue.
So if we need to work with large chunks of data, are we doomed to just throw lots of RAM at our problem? Thankfully, that’s not the case. If we take the build_arrays.rb example and decrease the array size, we’ll see a decrease in the point where memory usage plateaus that’s roughly proportional to the array size.
This means that if we can break our work into smaller pieces to process and avoid having too many objects existing at one time, we can dramatically reduce the memory footprint. Unfortunately, that often means taking nice, clean code and turning it into more code that does the same thing, just in a more memory-efficient way.
In a real codebase, the source of a memory issue will likely not be as obvious as in the build_arrays.rbexample. Isolating a memory issue before trying to actually dig in and fix it is essential because it’s easy to make incorrect assumptions about what’s causing the problem.
I generally use two approaches, often in combination, to track down memory issues: leaving the code intact and wrapping a profiler around it, and monitoring memory usage of the process while disabling/enabling different parts of the code I suspect could be problematic. I’ll be using memory_profiler here for profiling, but ruby-prof is another popular option, and derailed_benchmarks has some great Rails-specific capabilities.
Here’s some code that’ll use a bunch of memory, where it may not be immediately clear which step is pushing up memory usage the most:
Using get_process_mem, we can quickly verify that it does use a lot of memory when there are a lot of Person records being created.
Result:
Looking through the code, there are multiple steps that seem like good candidates for using a lot of memory: building a big array of strings, calling #to_a on an Active Record relation to make a big array of Active Record objects (not a great idea, but done for demonstration purposes), and serializing the array of Active Record objects.
We can then profile this code to see where memory allocations are happening:
Note that the number being fed to run here is 1/10 of the previous example, since the profiler itself uses a lot of memory, and can actually lead to memory exhaustion when profiling code that already causes high memory usage.
The results file is rather lengthy and includes memory and object allocation and retention at the gem, file, and location levels. There’s a wealth of information to explore, but here are a couple of interesting snippets:
We see the most allocations happening inside Active Record, which would seem to point at either instantiating all the objects in the records array, or serialization with #to_json. Next, we can test our memory usage without the profiler while disabling these suspects. We can’t disable retrieving records and still be able to do the serialization step, so let’s try disabling serialization first.
Result:
That does indeed seem to be where most of the memory is going, with before/after memory delta dropping 81% by skipping it. We can also see what happens if we stop forcing the big array of records to be created.
Result:
This reduces memory usage as well, although it’s an order of magnitude less reduction than disabling serialization. So at this point, we know our biggest culprits, and can make a decision about what to optimize based on this data.
Although the example here was contrived, the approaches are generally applicable. Profiler results may not point you at the exact spot in your code where the problem lies, and can also be misinterpreted, so it’s a good idea to follow up by looking at actual memory usage while turning sections of code on and off. Next, we’ll look at some common cases where memory usage becomes an issue and how to optimize them.
A common source of memory issues is deserializing large amounts of data from XML, JSON or some other data serialization format. Using methods like JSON.parse or Active Support’s Hash.from_xml is incredibly convenient, but when the data you’re loading is large, the resulting data structure that’s loaded in memory can be enormous.
If you have control over the source of the data, you can do things to limit the amount of data you’re receiving, like adding filtering or pagination support. But if it’s an external source or one you can’t control, another option is to use a streaming deserializer. For XML, Ox is one option, and for JSON yajl-ruby appears to operate similarly, although I don’t have much experience with it.
Here’s an example of parsing a 1.7MB XML file, using Hash#from_xml.
111MB for a 1.7MB file! This clearly is not going to scale up well. Here’s the streaming parser version.
This brings us down to a negligible memory increase and should be able to handle vastly larger files. However, the tradeoff is that we now have 28 lines of handler code we didn’t need before, which seems like it’d be error prone, and for production use it should have some tests around it.
As we saw in the section about isolating memory usage hotspots, serialization can have high memory costs. Here’s the key part of people.rb from earlier.
Running this with 100,000 records in the database, we get:
The issue with calling #to_json here is that it instantiates an object for every record, and then encodes to JSON. Generating the JSON record-by-record so that only one record object would need to exist at a time reduces the memory usage significantly. None of the popular Ruby JSON libraries appear to handle this, but a commonly recommended approach is to build the JSON string manually. There is a json-write-stream gem that provides a nice API for doing this, and converting our example to this looks like:
Once again, we see optimization has given us more code, but the result seems worth it:
A great feature added to Ruby starting with 2.0 is the ability to make enumerators lazy. This is great for improving memory usage when chaining methods on an enumerator. Let’s start with some code that isn’t lazy:
Result:
What happens here is that at each step in the chain, it iterates over every element in the enumerator, producing an array that has the subsequent method in the chain invoked on it, and so forth. Let’s see what happens when we make this lazy, which just requires adding a call to lazy on the enumerator we get from times:
Result:
Finally, an example that gives us a huge memory usage win, without adding a lot of extra code! Note that if we didn’t need to accumulate any results at the end, for instance, if each item was saved to the database and could then be forgotten, there would be even less memory usage. To make a lazy enumerable evaluate at the end of the chain, just add a final call to force.
Another thing to note about the example is that the chain starts with a call to times prior to lazy, which uses very little memory since it just returns an enumerator that will generate an integer each time it’s invoked. So if an enumerable can be used instead of a big array at the beginning of the chain, that will help.
One real-world application of building an enumerable to lazily feed into some sort of processing pipeline is processing paginated data. So rather than requesting all the pages and putting them into one big array, they could be exposed through an enumerator that nicely hides all the pagination details. This could look something like:
We’ve done some characterization of memory usage in Ruby, and looked at some general tools for tracking down memory issues, as well as some common cases and ways to improve them. The common cases we explored are by no means comprehensive and are highly biased by the sort of issues I personally have encountered. However, the biggest gain may just be getting in the mindset of thinking about how the code will impact memory usage.
This post originally appeared in Toptal Engineering blog