How to convert a SQLContext Dataframe to RDD of vectors in Python?
-
16-10-2019 - |
سؤال
I have a SQLContext data frame derived from pandas data frame consisting of several numerical columns. I want to perform multivariate statistical analysis using the pyspark.mllib.stats package. The statistics function expects a RDD of vectors. I could not convert this data frame into RDD of vectors. Is there a way to convert the data frame?
Code:
rdd = sqlCtx.createDataFrame(df_new)
summary = Statistics.colStats(rdd)
I am getting df_new from
df_new = df.applymap(lambda s: dic.get(s) if s in dic else s) #df is a pandas dataframe
I am getting a PY4JJava error at the summary line. The issue is with the format of rdd.
المحلول
The Dataframe Python API exposes the RDD of a Dataframe by calling the following :
df.rdd # you can save it, perform transformations of course, etc.
df.rdd returns the content as an pyspark.RDD of Row.
You can then map on that RDD of Row transforming every Row into a numpy
vector. I can't be more specific about the transformation since I don't know what your vector represents with the information given.
Note 1: df
is the variable define our Dataframe.
Note 2: this function is available since Spark 1.3