What is the Tinkerpop equivalent of Neo4j match/join on property values

Question

As with anything there are lots of ways to approach this. Since you have a small dataset, I didn't think it would be a problem to do lookups over g.V within the Gremlin pipeline. To simulate your problem, I created my own: Using the toy graph, add a sameFirstLetter edge to any vertex that had a lang property to any vertex that had an age property and where the first letter of their respective name properties was the same. In this case, it should add two vertices, from 5 to 4 and from 3 to 4.

gremlin> g = TinkerGraphFactory.createTinkerGraph()                                                                  
==>tinkergraph[vertices:6 edges:6]
gremlin> g.V.has('lang').transform{v->[v,g.V.has('age').filter{it.name.startsWith(v.lang[0])}.toList()]}.sideEffect{edgeList->edgeList[1].each{it.each{edgeList[0].addEdge('sameFirstLetter',it)}}}
==>[v[3], [v[4]]]
==>[v[5], [v[4]]]
gremlin> g.E
==>e[1][5-sameFirstLetter->4]
==>e[10][4-created->5]
==>e[0][3-sameFirstLetter->4]
==>e[7][1-knows->2]
==>e[9][1-created->3]
==>e[8][1-knows->4]
==>e[11][4-created->3]
==>e[12][6-created->3]

There are two pieces to this code. The first constructs an adjacency list of matches and the second creates the edge. Here's the part that gets the adjacency list:

g.V.has('lang').transform{v->[v,g.V.has('age').filter{it.name.startsWith(v.lang[0])}.toList()]}

The above code basically says, grab all vertices that have a lang property (note that this use of has is part of the latest version of Gremlin - the soon to be released 2.4.0. Prior to 2.4.0 you could do .hasNot('lang',null) or something similar) and then convert them to a list where the first item in the list is the lang vertex and the second item in the list is a list of vertices in the graph that match on the first letter of name with the first letter of lang (in this case the letter j for both lang vertices).

.sideEffect{edgeList->edgeList[1].each{it.each{edgeList[0].addEdge('sameFirstLetter',it)}}}

The above sideEffect is processing this output...the adjacency list:

==>[v[3], [v[4]]]
==>[v[5], [v[4]]]

This operation could be performed as a separate line of code (not all Gremlin needs to be written in a single line...as satisfying as that may be). You could simply store the adjacency list to a variable then post-process it to create the edges. In any case, I chose to use sideEffect here, where I loop the list of lists creating edges as I go.

Alternatively, you could also make two passes through the dataset by building up an in-memory index keyed on the property value and then use that as a lookup to build the adjacency list. In this way you would only suffer two passes through the the vertex list:

gremlin> m=g.V.groupBy{it.name[0]}{it}.cap.next()
==>v=[v[2]]
==>r=[v[5]]
==>p=[v[6]]
==>l=[v[3]]
==>m=[v[1]]
==>j=[v[4]]
gremlin> g.V.has('lang').transform{[it,m[it.lang[0]]]}
==>[v[3], [v[4]]]
==>[v[5], [v[4]]]

This gets you to the same adjacency list as found in the previous example. Edge creation via the adjacency list is still the performed as previously noted.