Extensible distributed graph sampling library for Spark GraphFrames.
This library is a pure Python package. It expects to run inside a Spark environment
where pyspark
is available, and the GraphFrames package is attached to the Spark
session.
pip install graph-distributed-sampling
GraphFrames is provided as a Spark package. Start Spark with GraphFrames attached, e.g.:
# Adjust versions to your Spark/Scala setup
pyspark \
--packages graphframes:graphframes:0.8.3-spark3.5-s_2.12
Or in code when creating a SparkSession:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("sampling")
.config("spark.jars.packages", "graphframes:graphframes:0.8.3-spark3.5-s_2.12")
.getOrCreate()
)
from graphframes import GraphFrame
from distributed_sampling import sample_graph
# gf: a GraphFrame with columns id, src, dst
sampled = sample_graph(gf, "random_node", fraction=0.1, seed=42)
sample_graph(graph: GraphFrame, method: str, **kwargs) -> GraphFrame
— top-level API.- Samplers implement the interface
GraphSampler
with asample
method.
random_node
— random node sampling with induced subgraph.random_edge
— randomly sample edges and keep their endpoints.snowball
— BFS-style expansion from seeds up to a given depth.random_walk
— multiple random walks; induced subgraph of visited nodes.forest_fire
— burn a fraction of neighbors iteratively.topk
— keep top-K vertices by degree (in/out/total).bfs(seeds=None, num_seeds=1, depth=2)
— breadth-first expansion.uvsn(fraction=0.1|num_vertices, seed=None)
— uniform vertex sampling with neighborhood.nuvsn(fraction=0.1|num_vertices, seed=None)
— non-uniform (degree-weighted) vertex sampling with neighborhood.mhrw(walk_length=100, num_walks=10, start_nodes=None, seed=None)
— Metropolis-Hastings random walk.rwe(walk_length=100, num_walks=10, start_nodes=None, alpha=0.15, seed=None)
— random walk with escaping/restarts.mirw(walk_length=10, num_walks=100, seed=None)
— multiple independent walkers.mdrw(walk_length=10, num_walks=10, attr=None, seed=None)
— multi-dimensional walk biased by vertex attribute.dfs(seeds=None, num_seeds=1, max_steps=100)
— approximate depth-first sampling.
- Implement a class that subclasses
GraphSampler
:
from distributed_sampling.samplers import GraphSampler
class MySampler(GraphSampler):
name = "my_method" # optional convenience
def __init__(self, my_param: float = 0.5):
self.my_param = my_param
def sample(self, graph):
# return a GraphFrame
...
- Expose it via entry points in your own package
pyproject.toml
:
[project.entry-points."distributed_sampling.samplers"]
my_method = "my_pkg.my_module:MySampler"
Your sampler will be discoverable by name "my_method"
at runtime without code changes here.
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest
https://arxiv.org/pdf/1308.5865
- python -m build
- twine upload dist/*