R in production

https://datascience.stackexchange.com/questions/5244

16-10-2019
|

Pergunta

Many of us are very familiar with using R in reproducible, but very much targeted, ad-hoc analysis. Given that R is currently the best collection of cutting-edge scientific methods from world-class experts in each particular field, and given that plenty of libraries exist for data io in R, it seems very natural to extend its applications into production environments for live decision making.

Therefore my questions are:

did someone of you go into production with pure R (I know of shiny, yhat etc, but would be very interesting to hear of pure R);
is there a good book/guide/article on the topic of building R into some serious live decision-making pipelines (such as e.g. credit scoring);
I would like to hear also if you think it's not a good idea at all;

Solução

Speed of code execution is rarely an issue. The important speed in business is almost always the speed of designing, deploying, and maintaining the application. An experienced programmer can optimize where necessary to get code execution fast enough. In these cases, R can make a lot of sense in production.

In cases where speed of execution IS an issue, you are already going to find an optimized C++ or some such real-time decision engine. So your choices are integrate an R process, or add the bits you need to the engine. The latter is probably the only option, not because of the speed of R, but because you don't have the time to incorporate any external process. If the company has nothing to start with, I can't imagine everyone saying "let's build our time critical real-time engine in R because of the great statistical libraries".

I'll give a few examples from my corporate experiences, where I use R in production:

Delivering Shiny applications dealing with data that is not/ not yet institutionalized. I will generally load already-processed data frames and use Shiny to display different graphs and charts. Computation is minimal.
Decision making analysis that requires heavy use of advanced libraries (mcclust, machine learning) but done on a daily or longer time-scale. In this case there is no reason to use any other language. I've already done the prototyping in R, so my fastest and best option is to keep things there.

I did not use R for production when integrating with a real-time C++ decision engine. Issues:

An additional layer of complication to spawn R processes and integrate the results
A suitable machine-learning library (Waffles) was available in C++

The caveat in the latter case: I still use R to generate the training files.

Outras dicas

R and most of its CRAN modules are licensed using the GPL.

In many companies, legal departments go crazy if you propose to use anything that is GPL in production... It's not reasonable, but you'll see they love Apache, and hate GPL. Before going into production, make sure it's okay with the legal department. (IMHO you are safe to use your modified code for internal products. Integrating R into your commercial product and handing this out to others is very different. But unfortunately, many legal departments try to ban all use of GPL whatsoever.)

Other than that, R is often really slooow unless calling Fortran code hidden inside. It's nice when you are still trying to figure out what to do. But for production, you may want maximum performance, and full integration with your services. Benchmark yourself, if R is the best choice for your use case.

On the performance issues with R (I know R advocates are going to downvote me for saying so ...):

Morandat, F., Hill, B., Osvald, L., & Vitek, J. (2012). Evaluating the design of the R language. In ECOOP 2012–Object-Oriented Programming (pp. 104-131). Springer Berlin Heidelberg.

(by the TraceR/ProfileR/ReactoR people from purdue, who are now working on fastR which tries to execute R code on the JVM?) states:

On those benchmarks, R is on average 501 slower than C and 43 times slower Python.

and:

Observations. R is clearly slow and memory inefficient. Much more so than other dynamic languages. This is largely due to the combination of language features (call-by-value, extreme dynamism, lazy evaluation) and the lack of efficient built-in types. We believe that with some effort it should be possible to improve both time and space usage, but this would likely require a full rewrite of the implementation.

Sorry to break the news. It's now my research, but it aligns with my observations.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange