Advocating for your Analytic Infrastructure - VirtualBox and Vagrant
tl;dr
Do cool stuff with R, RStudio products and VirtualBox!
Last weekend I attended SatRdays Chicago to give a talk on advocating for great analytic infrastructure as a data scientist. Specifically, I wanted to share some ideas for how someone might start developing skills and exploring this space.
There are a number of barriers to entry around analytic administration, and it’s something I’m incredibly passionate about helping people navigate. Nathan Stephens regularly writes and speaks about R Administration. I’ve adopted his framework for talking about the process of bringing R into organizations and the role of an R Admin through that process.
Nathan has identified three buckets or stages as part of this journey: Legitimacy, Competencies and Adoption. The focus of my talk last weekend was to share some thoughts around the second stage - how to go about developing competencies.
Find a Place to Learn#
Once you’ve committed to developing competencies and managing R tooling, you’ll need to decide where to do that learning. My first recommendation is to focus on doing your learning in environments that closely mirror the analytic infrastructure that currently exists inside your organization.
If your organization has invested in a particular cloud platform, see if you can get an account there to start learning those systems. If you’re trying to decide whether to use Docker, do some research - does your IT department currently support containerized applications? If you don’t feel like you’re ready to ask those questions or make those decisions, there are also some good options for developing competencies which don’t require you to ask for accounts, access or advice from anyone.
The important thing is to find a learning environment. If mirroring your existing infrastructure is a barrier in any way, don’t waste cycles on trying to go down that path! The rest of this article is all about how to get learning environments on your local desktop machine (Mac, Windows, Linux…) for zero dollars.
Learning Environments for Zero Dollars: Virtual Machines, VirtualBox#
Learn the lay of the land: VirtualBox + RStudio Team QuickStart VM#
A common barrier to getting started with RStudio professional products is that some of them (RStudio Connect, RStudio Package Manager) don’t have open source analogs. There are free 45-day evaluations for all the professional products, but it’s hard to jump straight into an on-premises installation evaluation before first getting your hands on the products to learn a bit about how they’re intended to work and integrate with each other.
The RStudio Team QuickStart VM is a virtual appliance file you can download here. Warning: It’s a large download, but it does enable you to explore the functionality of RStudio Workbench, RStudio Connect and RStudio Package Manager on your own desktop (no server required!).
RStudio Team QuickStart VM requires Oracle VM VirtualBox - a free and open source virtualization product that runs on Mac, Windows, and other operating systems. The instructions for getting set up with VirtualBox and QuickStart are all available here.
Once you’re up and running, navigate to localhost:5000
in an internet browser. There you should see the QuickStart landing page and instructions for logging in to each product. The landing page also offers a Product Tour of RStudio Connect and Resource Guide. I recommend browsing through each of these pages before you log in to the products themselves so that you’ll know what to expect.
Start with a Sandbox: VirtualBox + Vagrant#
Once you know how all the tools work and which ones could be helpful to your team, it’s time to start learning how to build a sandbox of your own.
Developing competencies in R (and RStudio) management requires hands-on experience with installation and configuration. There are verbose admin guides available for all RStudio products, but it can be difficult to know where to begin. My advice here is to set a clear goal and start simple.
Last year I wrote a piece for the RViews blog on Learning Analytic Administration Through Sandboxes. It is a how-to guide for setting up a particular flavor of RStudio and Shiny server sandbox on AWS cloud resources. If you want to invest in learning cloud, that can be a great first project. But the real point of the piece wasn’t the how-to guide. I wrote it to convey one simple message: Start small.
Nathan Stephens gave an RStudio Webinar last month, Best Practices for Administering RStudio in Production, which laid out a really nice mental model for how we tend to describe levels of infrastructure complexity and maturity:
This diagram starts with a sandbox containing the same products as the QuickStart. I like to think of sandboxes as projects that exist on a single machine. I build all sorts of different sandboxes; some have only one RStudio product on them, some have several. You can always install and configure RStudio Pro products under a free 45-day evaluation license, which is especially useful for creating short-lived learning environments. Installers for the open source products are very similar, but aren’t subject to the 45-day evaluation period, and can be used for longer-living sandboxes.
If you already installed VirtualBox in order to try the RStudio Team QuickStart VM, a great first project would be to install one (or more) of the RStudio server products from scratch on your own VirtualBox machine.
Build your own (mini) RStudio Team QuickStart VM#
There are a few things that I consider hard requirements to have for a nice learning environment:
- A modern Linux operating system
- An internet connection
- Sudo access
In addition to VirtualBox, I also highly recommend installing Vagrant. Vagrant is a command-line utility for building and managing virtual machine environments.
- Install the latest version of Vagrant
- Review the getting started guide: Get Started with VirtualBox and Vagrant
Many of the Vagrant examples you’ll see (in the getting started guide and elsewhere) use a hashicorp/precise64
box. This is actually a "standard Ubuntu 12.04"" VM.
-
Unfortunately, Ubuntu 12 is no longer a supported platform for RStudio products. So I don’t recommend using
hashicorp/precise64
for your sandbox. -
Luckily, it is also really easy to find a new box with a more modern operating system: Discover Vagrant Boxes (For this how-to guide, I picked
ubuntu/xenial64
).
Set up a Sandbox with Vagrant and VirtualBox#
1. Create a new directory to work in#
2. Run the following command to add ubuntu/xenial64 to your stash of boxes:#
vagrant box add ubuntu/xenial64
3. Proceed to run the init
and up
commands as shown above.#
Running vagrant init ubuntu/xenial64
will create a Vagrantfile
in your current directory. You don’t have to worry too much about the Vagrantfile now; it’s something that can be interesting to learn about later though.
kellys-MBP-2:vagrant-xenial kelly$ vagrant init ubuntu/xenial64
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment!
kellys-MBP-2:vagrant-xenial kelly$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Importing base box 'ubuntu/xenial64'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'ubuntu/xenial64' is up to date...
4. Finally, get access to your new machine with the ssh
command#
kellys-MBP-2:vagrant-xenial kelly$ vagrant ssh
Welcome to Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0–146-generic x86_64)
5. The world is your oyster!#
You're ready to practice. There are plenty of things to do from here, but may I suggest getting started by building R from source?
Recommended Resources:
-
The data science lab resource, maintained by RStudio Solutions Engineering, has a great walkthrough for basic system setup and product installations.
-
To get a general overview of installation, configuration and architecture diagrams, check out the RStudio in Production resource (currently a work in progress).
6. Keep a record#
My last recommendation for this article is to keep a record of the installation process. The things you do and learn now will be valuable to share with other people in your organization as well as your future self.
Remember that developing competencies for the management of analytic infrastructure is a journey. The skills you learn by building sandboxes today will translate into the expertise and understanding necessary for implementing more complex architectures later on.
- Shutting down the Sandbox
One of my biggest pet-peeves is when how-to guides don’t tell you how to shut down what you started. You can exit and halt (stop) the machine like so:
vagrant@ubuntu-xenial:/$ exit
logout
Connection to 127.0.0.1 closed.
kellys-MBP-2:vagrant-xenial kelly$ vagrant halt
==> default: Attempting graceful shutdown of VM…
I can also see that the VM is shutdown in the VirtualBox GUI:
Good luck with your adventures in sandbox-driven learning! As always, feel free to reach out to the RStudio Solutions Engineering team and R Admins community on the forum.