Sampling methods and estimation of triangle count distributions in large networks

Abstract This paper investigates the distributions of triangle counts per vertex and edge, as a means for network description, analysis, model building, and other tasks. The main interest is in estimating these distributions through sampling, especially for large networks. A novel sampling method tailored for the estimation analysis is proposed, with three sampling designs motivated by several network access scenarios. An estimation method based on inversion and an asymptotic method are developed to recover the entire distribution. A single method to estimate the distribution using multiple samples is also considered. Algorithms are presented to sample the network under the various access scenarios. Finally, the estimation methods on synthetic and real-world networks are evaluated in a data study.


Introduction
Triangles formed by edges are the most fundamental topological structures of networks. The number of triangles enters network metrics, such as clustering coefficient or transitivity ratio (Newman, 2018), and serves in network description, analysis, and model building. Applications exploiting triangle counts include spam detection (Becchetti et al., 2008), discovering common topics on the web (Eckmann & Moses, 2002), query plan optimization in databases ( Bar-Yossef et al., 2002), community detection (Palla et al., 2005), and others.
Triangle counting has also spurred much theoretical research on suitable counting algorithms for various contexts and goals, for example, to scale well with increasing network size, to adapt to streaming data of edges, to work on networks with restricted access, and many others. The recent review paper (Al Hasan & Dave, 2018) describes well the many developments in this research direction. Algorithms based on various sampling schemes have also been studied actively, especially for large networks, with the goal of estimating the number of triangles space efficiently, even if not quite exactly. See, for example, Jha et al. (2015), Stefani et al. (2017), and the references in Al Hasan & Dave (2018).
In this work, our focus is also on sampling methods and triangles in large networks. But we go beyond the number of triangles as in much of the earlier literature and consider instead the distributions of triangle counts, per both vertex (node) and edge. We are interested in estimating these distributions through suitable sampling schemes. Note that our object of study involves a range of counts (i.e. the number of vertices or edges participating in 0, 1, ... triangles) and not just a single count as the total number of triangles. For this reason perhaps, the considered distributions of triangle counts have thus far received relatively little attention in the literature. The few related exceptions are the works of Stefani et al. (2017) and Lim et al. (2018), which estimate the number of triangles for each vertex in streaming graphs. However, it is yet to be seen whether these estimates translate successfully to obtain the distributions of triangle counts (see Section 2.3 below for further discussion). These distributions are nevertheless informative, showing how triangles are distributed over the network (scattered or concentrated) with respect to vertices and edges. For example, as we discuss below (Section 3.2.3), these distributions might have power-law tails, suggesting the existence of multiple hub-like vertices/edges participating in large numbers of triangles. For instance, in social networks, the number of triangles of a user is used to identify its social role in the network, and provides a metric in assessing the content quality provided by the user.
Our contributions are several fold.
• We propose a new sampling scheme tailored for estimating triangle count distributions per vertex and edge, which we call vertex-induced edge (VIE) sampling. Some available classical sampling schemes such as induced and incident subgraph sampling (e.g. Kolaczyk, 2009, Chapter 5.3.1), often do not carry enough information for proper inference about triangle count distributions; other sampling schemes that seem natural for these distributions can be computationally over-expensive (see Section 2.3 below). VIE sampling balances these issues and is also the scheme that is amenable to theoretical analysis. • We consider three sampling designs of VIE sampling: without replacement, with replacement, and Bernoulli sampling adapted to network access scenarios with full access, restricted access, and streaming edges, respectively. • We develop two estimation methods from a VIE sample: an inversion approach for the bulk of the distribution that is based on a relationship between the true distribution and the sampled distribution and an asymptotic approach for the tail of the distribution that involves an asymptotic equivalence between the two distributions. This follows our and others' earlier work on similar problems in other sampling contexts. • We consider an estimation method based on scaling the number of triangles of the sampled vertices and edges with VIE using multiple samples. A similar approach was used to estimate the numbers of triangles per nodes in streaming data for a unique sample (Lim et al., 2018). The distributions of triangle counts per vertex and edge are then obtained from the empirical distributions of the scaled values. The method will be used as a baseline for comparison with the inversion and asymptotic methods. • We discuss and present algorithms that implement the VIE sampling approach to the situations of full access, restricted access (where we use random walks), and streaming data of edges (where we employ hashing). • We examine our proposed sampling and estimation methods on simulated and real-world networks.
The rest of the paper is organized as follows. Section 2 presents quantities of interest and our sampling framework. Section 3 concerns estimation approaches based on inversion and asymptotics. Section 4 considers the case of multiple samples collected in parallel. Section 5 details our sampling algorithms. Section 6 includes data applications. Finally, Section 7 concludes and discusses directions for future work.
This work is an extended version of our conference paper (Antunes et al., 2020). We expand here the previously proposed VIE sampling approach by allowing three sampling designs. As alluded to above, these accommodate the main network access scenarios. The inversion and The symbol^represents the estimator of the quantity asymptotic estimation methods have been derived for the three sampling designs. Estimation from multiple samples is also presented here for the first time. The sampling algorithms to implement VIE sampling in the several network access scenarios are presented according to the different designs and estimation methods. Finally, we added several numerical results and provided more details and explanation throughout the text.

Quantities of interest and sampling
For ease of reading, Table 1 lists the main mathematical notation used throughout the paper. The rest of this section provides their definitions.

Triangle counts per vertex and edge
We represent a network under study as a graph G = (V, E), where V is the set of vertices (nodes) and E is the set of edges. The graph is assumed to be unweighted, undirected, and without loops or multiple edges (simple). Let N = |V| and M = |E| denote the numbers of vertices and edges, respectively, which are assumed to be known for simplicity. The numbers of triangles of a vertex w ∈ V and an edge (u, v) ∈ E are defined by denote the total number of vertices with j triangles, referred to as triangle count per vertex, where N t = max{T w } is the maximum number of triangles of a vertex in G. The proportion of vertices participating in j triangles is Similarly, let denote the total number of edges with j triangles, referred to as triangle count per edge, where M t = max{T (u,v) } is the maximum number of triangles per edge. The proportion of edges participating in j triangles is See Figure 1 for an illustration of the quantities defined. We are interested in the distribution of triangle counts per vertex and the distribution of triangle counts per edge t e = (t e (0), . . . , t e (M t )) In principle, the largest possible value for M t in a simple network is N − 2 and we assume that in practice, N t and M t may be known, or set large enough.

Sampling framework
Sampling has been used extensively for large complex networks in order to produce network statistics from a portion of the network which, if computed for the full graph, would be prohibitively expensive. When the only source of randomness in the network is the sampling design, the estimation procedures constructed under this assumption are refereed to as design-based procedures. In order to incorporate other sources of randomness into the network, such as the topology of the graph G, one generally specifies a model, leading to the so-called model-based procedures. We focus on design-based procedures in this work. We introduce and focus on a general sampling framework for estimating the distributions of triangle counts of interest. The basic idea follows the work of Buriol et al. (2006), where to estimate the total number of triangles in the network, pairs consisting of a vertex and an edge are sampled to check if they form a triangle. We suppose that there are two separate sets of units being sampled, vertices and edges. Our sampling method can be characterized as having two stages: a selection stage, followed by an observation stage. More precisely, a sample of vertices is selected from V, which yields the set of sampled vertices V * , and a sample of edges is selected from E, resulting in the set of sampled edges E * . The manner in which vertices and edges are sampled can differ. Then, for each sampled vertex w ∈ V * , we count the number of triangles of w formed with the sampled edges (u, v) in E * , by checking if the edges (w, u) and (w, v) exist in the graph. Similarly, for each sampled edge (u, v) ∈ E * , we count the number of triangles of (u, v) formed with the sampled vertices w in V * , by observing if the edges (w, u) and (w, v) belong to E. We call this sampling method the VIE sampling.
We shall consider three main sampling designs of the graph G to obtain V * and E * . In the first design, n vertices and m edges are sampled from V and E uniformly at random, but without replacement. This situation arises naturally in static networks with full access and ensures that all units in V * and E * are distinct. In other contexts, while units are sampled with equal probability, sampling is done with replacement. For example, in static networks with restricted access, random walks are used to crawl the network. These random walks share important proprieties with random sampling with replacement. The standard random walk on vertices (selecting any of their neighbors with equal probabilities) samples (visits) edges uniformly at random with replacement in its stationary regime. By employing the standard Metropolis-Hastings (MH) technique to correct the bias in the standard random walk, vertices can also be sampled (visited) uniformly at random with replacement in the stationary regime. In Bernoulli sampling, each unit, vertex or edge, of the graph is subjected to an independent Bernoulli trial, with probabilities p v or p e , respectively, which determines whether the unit becomes part of the sample sets V * and E * . Sampling conducted in this manner appears in the case of streaming graphs. A random sample size is a potential drawback of Bernoulli sampling; however, this is counterbalanced by the ease of Bernoulli sampling and the benefits of guaranteed bounds on the sample size. Finally, we will also consider a mixed sampling of Bernoulli and with replacement samplings motivated by a random walk algorithm proposed in Section 5. While the considered sampling designs are relatively simple, they are by no means uncommon, and occur under a variety of scenarios described above.
For any of the considered sampling designs, the measured sample analogues of Equations (1) and (2) are and Note that the quantities (7) and (8) are not those in Equations (1) and (2) defined for the graph resulting from the VIE sampled edges and vertices: for example, a triangle formed just by the edges of E * (with no vertices in V * ) would not be counted in either Equation (7) or Equation (8).
The sample triangle count per vertex is where N s t need not be the same as N t and could even be larger as in sampling with replacement. Similarly to Equation (4), let be the proportion of vertices participating in i triangles after sampling and set . The sample triangle count per edge is defined by where M s t depends on the sampling design. Finally, let denote the proportion of edges participating in i triangles after sampling and set t s e = (t s e (0), . . . , t s e (M s t )) . See Figure 2 for an illustration of the quantities defined. Our goal is to estimate t v and t e defined in Section 2.1 from the observed sample quantities (7)-(12).

Comparison with other sampling approaches
When sampling without replacement or with Bernoulli sampling, a subgraph G s = (V s , E s ) can be constructed from the sampled vertices w ∈ V * and edges (u, v) ∈ E * , along with the edges (w, u) and (w, v) if they belong both to E. We would like to contrast this subgraph with other classical approaches like induced or incident subgraph sampling (e.g. Kolaczyk, 2009, Chapter 5.3.1).
For example, with Bernoulli sampling when p v = 0 and p e > 0, observe that only edges are sampled and that G s is the so-called incident subgraph sampling. But we note again that the quantities T s w and T s (u,v) of interest are not triangle counts in the resulting subgraph (since no vertices are sampled when p v = 0, T s w = 0 and T s (u,v) = 0 for all vertices w and edges (u, v) in G s ). In fact, for incident or induced subgraph sampling, if we considered the triangle counts per vertex (in the resulting subgraph), this setting would not be amenable to theoretical analysis, in the sense that we could not relate the count distributions per vertex in the subgraph to those in the original graph (as in Section 3 for the VIE sampling).
We comment here further on our choice of the sampling schemes. For example, for estimating t v , a natural and simple sampling scheme could consist of sampling vertices and then for each sampled vertex, calculating the exact number of triangles (by checking all pairs of its neighbors and seeing how many of these also connect) and then just using the empirical distribution of these as an estimate of t v . This sampling scheme, however, may not be practical, especially for large networks with heavy-tailed degree distributions, since checking all pairs of neighbors becomes prohibitively expensive for vertices with large degrees. Furthermore, this scheme would not work for sampling a streaming graph in one pass which is also an aim of this work. In contrast, with VIE sampling, only the sampled edges in E * with incident vertices neighboring a sampled vertex, are used to count the number of triangles, mitigating the aforementioned computational issue. On the other hand, for estimating t e , note that sampling an edge and then calculating the exact number of triangles is not as big of an issue, particularly in networks showing disassortative mixing, since for each sampled edge, it is less likely that both of its vertices have large degrees. We will see that this case can be emulated with VIE sampling (see Section 3.1.2).
Another possibility, as noted in Section 1, would be to estimate triangle counts for sampled vertices and then translating these to triangle count distributions. This is the basis for the method proposed in Section 4 below, using VIE sampling with multiples samples, which also serves as a baseline that other estimators are compared to (Section 3). We compared our approach with the work (Lim et al., 2018, algorithm Mascot-C, p. 7), where edges are sampled with constant probability and the number of local triangles for each vertex are estimated from the sampled graph. However, the estimation of triangle count distribution per vertex was worse than for VIE sampling using multiple samples (and, in fact, even when using a single sample). This is due in part to the estimation of the local triangles for all vertices in the sampled graph, while in our case, we fixed the sampled vertices and repeat the sampling of the edges reducing the variability (see Section 4).

Estimation from a single sample
It seems natural to estimate simply t v (j) by a "plug-in" value t s v (j). This estimation, however, is biased. This line of reasoning is the focus of research on representative sampling (see e.g. Leskovec & Faloutsos, 2006), that studies how the topological properties of samples differ from those of the original network. Unfortunately, in estimation of the characteristics of our interest, many properties of the network are not captured from samples (unless standard assumptions of samples with i.i.d. observations are possible).
One possible estimation strategy can be based on appropriate adjustments of the "plug-in" estimators to correct the bias. Deriving such corrections for network total counts (e.g. total number of triangles) is often relatively straightforward but this is more challenging for distributions. Furthermore, the performance of the correction depends on the topology of the network, the distribution of interest, the sampling design, and their interactions. In this section, we propose two estimation approaches to correct the bias of the observed quantities using one sample from VIE sampling. The first approach is based on the inversion of an exact relation and the second approach on the asymptotic equivalence between the (expected) sample and true quantities.

Inversion estimation
In this section, we present an approach to estimate t v and t e from sample distributions t s v and t s e . The approach is based on inversion and is common in similar problems (e.g. Antunes & Pipiras, 2016;Tune & Veitch, 2011;Zhang et al., 2015).

Distribution of triangle counts per vertex
Under the considered sampling designs, let P v (i, j) be the probability that a vertex with j triangles in G is observed to have i sampled triangles. If w * is a vertex selected at random after a VIE sampling realization, then conditioning on its number of triangles in G leads to the relation Note also that where the last expectation E is with respect to the randomness in the sampling. The relations (13)- (14) can be combined into a matrix form as ) matrix specific to the sampling design defined below. This suggests that an estimator of the triangle count distribution per vertex can be defined aŝ (15) and its variance(-covariance) matrix is Sampling without replacement If n vertices and m edges are sampled uniformly, without replacement, then with the convention that a b equals to 0 when either a < 0 or a > b. The representation (18) can be explained as follows. Since there are N n possible samples of size n and N−1 n−1 samples may be chosen to include a given vertex, it follows that the probability of a vertex being sampled is n/N. Note that the term j i M−j m−i / M m corresponds to the probability of sampling i triangles out of j (or, equivalently, sampling i edges out of j that form triangles with a vertex in question), which is given by the hypergeometric distribution with parameters j, m, and M. Similarly, 1 − n/N is the probability of not sampling a vertex, for which the number of sample triangles will be 0 with probability 1 and hence the term 1 {i=0} .

Sampling with replacement
In the case when n vertices and m edges are sampled with replacement, the matrix P v has dimension (m + 1) × (N t + 1), with entries Indeed, arguing as for Equation (18), the probability of sampling a vertex is 1 − 1 − 1 M n . The part containing this term in Equation (19) also includes the probability of sampling i triangles which is given by the binomial distribution with parameters m and j/M.

Bernoulli sampling
If each vertex and edge are sampled with probabilities p v and p e , respectively, then P v is a square (N t + 1) × (N t + 1) matrix, with entries which follows by similar arguments as for the other designs. S142 N. Antunes et al.

Distribution of triangle counts per edge
For any of the considered sampling designs, if (u, v) * is an edge selected at random after a VIE sampling realization, the same arguments as for Equations (13)-(14) lead to where P e (i, j) is the probability that an edge with j triangles participates in i sampled triangles. Then, the relation holds, similarly to Equation (15), where P e = (P e (i, j)) is a matrix with dimensions (M s t + 1) × (M t + 1) whose entries depend on the sampling design. The estimator of t e is obtained simply by replacing the expected value by the observed value in Equation (22) and inverting, that is, where P + e = (P e P e ) −1 P e . Following the same reasoning around (16)- (17), the estimator in Equation (23) is unbiased with variance matrix, Sampling without replacement In this design, P e is a ( min (M t , n) + 1) × (M t + 1) matrix that can be obtained from Equation (18)

Sampling with replacement
The matrix P e has dimensions (n + 1) × (M t + 1) and the form (19) by replacing N, n, M, m by M, m, N, n, respectively.

Bernoulli sampling
The matrix P e is (M t + 1) × (M t + 1) and has the form (20) by replacing p v , p e by p e , p v , respectively.

Mixed sampling
We will propose an algorithm in Section 5, which emulates the case of Bernoulli sampling of vertices with p v = 1 and edges with replacement. The corresponding matrix P e is (M t + 1) × (M t + 1) and has nonzero entries This could be seen from Equation (20) with p v and p e replaced by p e = m/M and p v = 1, respectively, or just by arguing directly. It can also be checked that P −1 e has a closed form with the nonzero entries In this case, since only P −1 e (i, i) is nonzero for the ith row of P −1 e (i = 0), the resulting estimator (23) (exceptt e (0)) is just a (scaled) empirical distribution of the true triangle counts per sampled edges.
Remark 1. Our distinction between samplings with and without replacement might appear somewhat restrictive in the following sense. Note that a sampling procedure with replacement can be carried out (e.g. for distribution of triangle counts per vertex) but then only distinct sampled edges be kept for inference under "sampling without replacement, " with the latter referring rather to how collected information is used. In fact, one could develop an analogous inversion approach for such sampling schemes as well, with their own probability matrices P v and P e . But we found the "sampling without replacement" inference for these sampling schemes to be slightly inferior to that with replacement when all sampled information collected without replacement is used. This probably should not be too surprising but might also go against some of the current practices (e.g. Zhang et al., 2015) where inference is made on sample subgraph even if vertices/edges repeat in sampling with replacement as when using a random walk. For the above reason and for simplicity sake, we decided not to include sampling schemes with replacement where only distinct sampled units are kept for inference.

Regularization approach
The performance of the inversion estimators in Equations (16) and (23) can be viewed through the matrix P v and P e condition numbers (i.e. the ratio of the largest to smallest singular values). For instance, the numerical inversion of P v P v in Equations (16) and (17) is governed by its condition number, which is the square of the condition number of P v . If the condition number is small, the matrix is well-conditioned and its inverse can be computed accurately. If the condition number is large, then the matrix is said to be ill-conditioned. The matrices P v and P e are ill-conditioned for smaller values of n and m or p v and p e . This means that computation of the inverse of P v P v is prone to large numerical errors that will produce estimates (16) with an oscillating behavior. Regularization is a common method used to solve ill-posed problems. Since the use of this technique is analogous for both cases, we consider only the triangle count distribution per vertex.
We solve our linear inverse problem from a penalized weighted least squares perspective with nonnegative constraints, where a regularization (penalty) term is added to the objective function and a penalization parameter controls the degree of the regularization. More specifically, the penalized estimatort v is defined as the solution of the optimization problem subject to t(i) ≥ 0, i = 0, 1, . . . , N t , N t i=0 t(i) = 1, where W is a matrix representing suitable weights taken here to be a diagonal matrix with entries t s v , φ(t) refers to the function regularizing t, and λ > 0 is a scalar penalty, that sets the degree of regularization, to be determined separately. When λ = 0, W = I and the constraints are ignored, the minimization (27) yields the inversion estimator (16). For the inversion problems considered in this work, however, regularization with λ > 0 is critical; the inversion estimator with λ = 0 is usually impractical due to a large variance. The regularization with λ > 0 reduces the variance but also introduces some bias, with λ chosen to balance the variance-bias trade-off. The triangle count distributions encountered in practice are usually smooth (i.e. t v (i) ≈ t v (j) for i and j close)-see Section 6. A convenient regularization is the use of the quadratic function that forces estimates to be smooth and significantly reduces their variance. The optimization problem (27)-(28) can be written as a quadratic program and solved for t using standard software. For the implementation details, we refer the reader to Zhang et al. (2015), where regularization is used in a similar context to estimate the degree distribution when sampling vertices and edges. The parameter λ is chosen based on Stein's unbiased risk estimation (SURE) method proposed by Eldar (2009). The use of this method is explained in Section 3.2 of Zhang et al. (2015). The choice of t s v for the diagonal of W also follows Zhang et al. (2015).

Asymptotic estimation
The regularization approach tends to perform poorly at the distribution tail (especially when the latter is heavy-tailed). Estimation in the tail is addressed in this section. We relate the tails of the distributions of triangle counts per vertex and edge with the respective tails of sample distributions. Since the latter is observable, the relation can be used to estimate the former. We also consider the case when the original distribution has a power-law tail.

Distribution of triangle counts per vertex
The sample triangle count of a sampled vertex w * chosen at random from V * after a VIE sampling realization, can be expressed as Note that E1 {(u,v)∈E * } = π e , where π e is the probability of sampling an edge under a given sampling design. Furthermore, since the second sum in Equation (29) is over the triangles including the vertex w * , it has T w * terms. It follows that Since the second term in Equation (30) can be thought as approximately Gaussian with standard deviation of the order √ T w * T w * for large T w * , we expect that for large i. Viewing Equation (31) as a relation for continuous random variables, by differentiation we have Since w * is a sampled vertex, P(T w * = i) = t v (i)π v , where π v is the probability that a vertex is included in the sample for a given sampling design. On the other hand, by Equation (14), P(T s w * = iπ e ) = E(t s v (iπ e )). Then, we can write the asymptotic relation (32) as is a natural estimator of the original distribution tail obtained through the scaling of the empirical sample distribution t s v . If the distribution of triangle counts per vertex has a power-law tail with parameter β, that is, for large i, where β > 0 and c > 0, then Equation (33) implies This means that the sample distribution t s v is expected to have a power-law tail as well with the same parameter β, which can be estimated directly from the empirical sample distribution.

Sampling designs
When random sets of n vertices and m edges are selected from V and E with or without replacement, π v = n/N and π e = m/M. With Bernoulli sampling, we have π v = p v and π e = p e .

Distribution of triangle counts per edge
The sample triangle count of a sampled vertex (u, v) * chosen at random from E * after a VIE sampling realization, is given by Following similar arguments as for Equations (30)-(32), we get for large i. Therefore, we have the relation and an estimator for the distribution tail for large i, through the scaling of the empirical distribution of t s e (i). The probabilities π v , π e depend on the sampling design as discussed at the end of Section 3.2.1.
Additionally, if for large i, where γ > 0 and c > 0, then which also has the same power-law exponent.

Relations between power-law exponents
The reference to and relevance of power-law tails above should not surprise the reader. On one hand, examples of such real networks appear in Section 6 below. On the other hand, such tails are also expected for the following reason. It is well known (e.g. Newman, 2018, Chapter 10) that the degree distributions of real networks can have power-law tails. That is, if D w * denotes the degree of a randomly selected vertex w * , then for large k, where c 0 > 0, α > 1. Furthermore, one commonly finds the clustering coefficient of a vertex w * , that is, T w * / D w * 2 or T w * /D 2 w * to be roughly ξ w * where ξ w * varies over a limited range. Then, T w * ∼ ξ w * D 2 w * and, by conditioning on ξ w * and using Equation (43), for large i. In particular, note that the tail exponent β of the distribution of triangle counts per vertex relates to α as This is also what we typically observe for real networks. Relationship between the tail exponent γ of the distribution of triangle counts per edge and the tail exponent α appears to be more delicate, and will not be discussed here.

Estimation from multiple samples
In this section, we study the feasibility of a single approach to estimate the bulk and the tail of the distribution of triangle counts per vertex and edge, at the cost of using several VIE samples collected in parallel. The main idea is to correct the bias of the sample quantities T s w and T s (u,v) through the use of standard weighted averaging techniques, and then form the empirical distributions using the scaled numbers of triangles of the vertices and edges sampled. Since this can be applied to a more general sampling scenario, we consider that each unit w ∈ V and (u, v) ∈ E can have unequal probability of being included in the sample.

Distribution of triangle counts per vertex
We shall first define estimators of interest for a single sample and then consider their averages from multiple samples. Suppose that, under a given sampling design (without replacement), each edge (u, v) ∈ E has probability π e (u, v) of being included in the sample E * . Then, the Horvitz-Thompson (HT) estimator (see e.g. Thompson, 2012;Tillé, 2006) of the total number of triangles of a sampled vertex w takes the form The estimator (46) is an unbiased estimator of T w , since E1 {(u,v)∈E * } = π e (u, v) and hence If π e ((u, v), (u , v )) denotes the probability that edges (u, v) and (u , v ) are both in the sample E * , then the variance of the HT estimator T w can be expressed as with π e ((u, v), (u , v )) = π e (u, v) for convenience when (u, v) = (u , v ).

Designs with replacement
The HT estimator can also be applied to random sampling with replacement. In this case, the sum in Equation (46) is over the distinct edges in E * and π e (u, v) is the probability of (u, v) being included in the sample E * , which is equal to 1 − (1 − 1/M) m . Similarly, the probability π e ((u, v), (u , v )) that both (u, v) and (u , v ) are included in the sample E * is given by v )) m ). However, in designs with replacement, the so-called Hansen-Hurwitz (HH) estimation (see e.g. Thompson, 2012;Tillé, 2006) is often employed that includes all the sample collected. This is also in line with the analysis conducted in Section 3, where repetitions of sampling units are taken into account. The HH estimator for T w is defined as , the estimator is unbiased as in Equation (47) and its variance can be shown to be In the case of random sampling with replacement, we have π e (u, v) = 1/M which yields The comparison of the variances in Equations (49), (50), and (53) shows that sampling without replacement has the lowest variance for m > 1, followed by Bernoulli sampling if T w > m assuming p e = m/M. For a vertex w ∈ V * , the variance of the estimator T w tends to increase with T w . However, the performance of T w will also be poor for small values of T w . The reason is twofold. First, for small probabilities of sampling edges or when the total number of triangles of a vertex is small, the observed triangles of a sampled vertex will likely be zero and thus so its estimate in Equation (46). On the other hand, when we scale the number of triangles observed by the probability of sampling an edge in Equation (46), there will be gaps between the possible values of the estimates. To improve estimation, we repeat the sampling of edges r times in parallel, increasing the sampling cost, and let It is then natural to set as an estimator for the count T v (i) of vertices with i triangles, where . denotes the "floor" integer part and π v (w) are the vertex inclusion probabilities corresponding to the underlying network sampling design. These will be qual to n/N, for sampling with and without replacement, and p v for Bernoulli sampling. The presence of π v (w) in Equation (55) is to compensate for the fact that the sum in Equation (55) is over w ∈ V * and hence only sampled vertices are included. The final probability estimator in this context takes the form

Distribution of triangle counts per edge
The analysis developed in the previous section holds for the distribution of triangle counts per edge. For instance, the HT estimator of the total number of triangles of a sampled edge (u, v) is By averaging through r samples of vertices obtained in parallel, let T r (u,v) = 1 r r k=1 T (k) (u,v) and from which the estimator is defined as t e (i) = T e (i)/M. Similar expressions for the HH estimator with a replacement design can be derived.
Remark 2. The HT or HH estimation developed above is quite general in its applicability. In fact, it is not limited to VIE sampling and applies when sampling probabilities π v (w) and π e (u, v) are different across w and (u, v). The latter occurs with other common network sampling designs, such as induced or incident subgraph sampling (see Kolaczyk, 2009, Chapter 5.3). However, we note again that these sampling methods are not amenable to the analysis developed in Section 3.

Algorithms
In this section, we show how to implement VIE sampling for several network access scenarios corresponding to the considered sampling designs. For restricted access and streaming graphs, the implementation is described through formal algorithms.

Sampling static graphs with full access
When any vertex or edge can be accessed directly in a static network, VIE sampling can be carried out through a simple algorithm that samples vertices and edges at random without replacement, and then for each sampled vertex in V * counts the number of triangles formed with the sampled edges in E * (and vice versa). Such full access to the network, however, may not be available in other scenarios, for example, when networks can only be crawled or when dealing with streaming edges.

Sampling static graphs with restricted access
Many real-world networks can only be crawled, i.e. one can only explore the neighbors of the visited vertices. In this context, sampling procedures are commonly based on random walks. It is assumed that access to one initial vertex is available and the network is connected or the largest giant connected component covers the majority of the network so that the disconnected parts can be ignored. Two independent random walks could be used to carry out VIE sampling with replacement: for instance, first performing a standard version of random walk sampling (i.e. selecting a vertex uniformly at random among the neighbors of the visited vertex) to sample edges uniformly at random (E * ); and then a random walk to sample vertices uniformly at random (V * ) using the MH algorithm. The number of triangles of each sampled vertex w formed with the sampled edges in E * (and vice versa) can be incremented in each step of the MH algorithm by checking the neighborhood of w. Alternatively, we propose here a scheme with a single random walk on edges, that implements mixed VIE sampling, that samples m edges at random with replacement and emulates p v = 1-see Algorithm 1. Initially, an edge is selected at random and the sample sets V * , E * are empty (line 1). Then, for each sampled edge (u, v) in E * , the number of common neighbors of u and v provides the number of triangles of that edge (line 5). Additionally, each common neighbor of u and v is added to V * and the triangle that forms with the edge (u, v) is increased by one (lines 6-10). Finally, the current vertex is set to v and the next vertex is chosen at random among its neighbors Algorithm 1: Mixed VIE sampling (p v = 1, m) with random walk Data: static graph G; Result: E * , V * , T s w and T s (u,v) representing the next sample edge (lines 11-12), until m edges have been sampled (line 2). After the algorithm is finished, the quantities T s w and T s (u,v) are used to compute Equations (10) and (12), respectively, which then enter into the estimators developed in Sections 3 and 4.
We also analyze Algorithm 1 concerning its processing time per edge and total running time. The main time-intensive operation is to compute the set of common neighbors for two incident nodes of each sampled edge (u, v) (line 5). Let D u denote the degree of node u ∈ V. Assuming that D u ≤ D v , to compute the set Neighbor(u) ∩ Neighbor(v), each element of Neighbor(u) is checked. If Neighbor(v) contains the element, it is added to the set. Thus, the running time for processing a sampled edge is of the order O(min(D u , D v ))). The total running time of the algorithm is O(D * max m), where D * max = max (u,v)∈E * (min(D u , D v )). The algorithm requires memory space to store the sets |V * | and |E * | = m and the respective vertex and edge triangle counts. The value of |V * | depends on the network structure and cannot be computed explicitly.

Sampling streaming graphs
Many real-world networks naturally evolve over time, as new edges/vertices are added to the network. A natural representation of such networks (or streaming graphs) is in the form of a stream of edges. We shall describe how to perform VIE Bernoulli sampling to select a subgraph G s = (V s , E s ) from G, which includes E * and V * , in one pass when G is presented as a stream of edges in no particular order. Two uniform random hash functions (hash v and hash e ) on [0, 1] are used to sample vertices and edges at random with probabilities p v and p e , respectively, which are then added to the sample sets E * and V * -see Algorithm 2 (lines 3-4 and lines 5-6). Additionally, the sampled vertices are also added to V s and edges to G s = (V s , E s ) (lines 7-10). We note that in the streaming scenario if a vertex is sampled, its edges have to be added to the subgraph G s (line 10). This is the cost of having a one pass algorithm over the input stream, in order to able to count the number of triangles of each sampled vertex in V * formed with the sampled edges in E * and vice versa (lines 11-12). The additional edges that do not enter into the VIE sampling can be deleted at the end of the stream but we omit this step in the algorithm. If two passes over the stream of edges are possible, these additional edges need not to be stored. The algorithm also applies to the case of a static graph with full access, as a one pass algorithm through the list of edges.
For Algorithm 2, the processing time per edge of the stream graph is not a time-consuming operation (lines 3-10). Another factor is the triangle count of vertices and edges (lines 11-12); however, compared with Algorithm 1, this is now done on the subgraph G s with lower running time. The algorithm requires memory space to store G s where the expectation of |E s | is given by  T s (u,v) M times the probability of adding an edge (1 − (1 − p e )(1 − p v ) 2 ) (the expectation of |V s | cannot be computed explicitly). If we are interested to guarantee a bounded sample size of n vertices and m edges selected at random without replacement from a streaming graph, we could use reservoir sampling (Vitter, 1985) by keeping n vertices and m edges with the minimum hash values in the reservoir to constructed the subgraph. However, additional edges need also to be added to reservoir to count the triangles for T s w , T s (u,v) . This is more involved and cumbersome to implement, and for simplicity, is not considered here.

Case of multiple samples
The algorithms described above can be adapted easily to obtain multiple samples of vertices or edges in parallel, by using independent random walks or multi-hashing functions for the estimation strategy developed in Section 4.

Data study
In this section, we assess the performance of proposed sampling and estimation methods for the triangle count distributions on synthetic and real networks. These are summarized in Table 2.

A single sample
We first consider the Chung-Lu model (e.g. Newman, 2018, Chapter 10), which has the powerlaw degree distribution P(D w * = k) ∼ ck −2.5 , with N = 20, 000 vertices and M = 350, 000 edges. In Figure 3(a)-(b), it is assumed that the network has restricted access. For Figure 3(a), vertices are sampled with a random walk using the MH version and edges through a standard random walk (as discussed in Section 5). The sample sizes n and m are equal to 20% of the total numbers of vertices and edges, respectively. We note that typical sampling rates to estimate network distributions, e.g. the degree distribution, are in the range of 10%-30% (Zhang et al., 2015). Figure 3(a) shows the true distribution of triangle counts per vertex and its estimate based on the inversion and asymptotic estimation methods developed in Section 3. With the inversion, the penalized estimator (27) allows to recover well only the bulk of the distribution. The penalization has the effect of shrinking the estimates leading to the reduction of the variance and forces some of the estimates in the tail to be zero. However, the estimation in the tail can be recovered through the asymptotic estimation that scales the sample distribution t s v as in Equation (34).
To compare the estimate with the true triangle count distribution, we use the total variation (distance), which has been previously used in graph sampling (e.g. Mohaisen et al., 2012). It is defined as half of the sum of the absolute differences between two probability mass functions, that is, d TV (t v , t v ) = 1 2 i |t v (i) − t v (i)| in our case, and ranges from zero to one. For values of i ≤ k, the bulk of the distributiont(i) is given by the inversion estimation and for i > k by the asymptotic estimation (say, k = 40 in Figure 3(a), where the inversion estimation is worse than the asymptotic one). The values of the total variation are given in the captions of the figures.
We have evaluated the impact on the results when the total numbers of vertices and edges are estimated. Assuming the network and access scenario above, where vertices are sampled at random with replacement (in stationary regime), estimators for N and M are given in Katzir et al. (2011) (Section 4) and Kolaczyk (2009) (Section 5.4.2, Equation (5.27)), respectively. We have run the MH algorithm 10,000 times. The average values of the estimates for the numbers of vertices and edges were 20,035.24 and 350,700.60, while the standard errors were 966.50 and 27,128.67, respectively. The averages of the relative errors were 4% (vertices) and 6% (edges). This shows that the estimates are in fact reasonably close to the true underlying values. We checked the estimation of the distribution of triangle count per vertex assuming an error of 10% in the estimates of N and M, such thatN = 1.1N andM = 1.1M. The comparison with Figure 3(a) showed still an accurate estimation of the bulk and tail of the distribution. The total variation between the estimate and true distribution was 0.24.
For Figure 3(b), the network is sampled using Algorithm 1 (mixed sampling (p v = 1, m)) where m is equal to 20% of the total number of edges. Since the algorithm emulates p v = 1, the matrix P e is not ill-conditioned with the inverse given by Equation (26). In this case, the estimator (23) can be used. Figure 3(b) depicts the estimation of the triangle count distribution per edge when using this inversion. We omit the asymptotic estimation (40) in the plot since it coincides with the inversion estimation from the discussion below Equation (26).
For Figure 3(c)-(d), it assumed that the network is a streaming graph. The synthetic power-law network defined above is converted into a stream of edges taken in random order. The network is then sampled through Algorithm 2 with p v = p e = 0.2. The subgraph G s selected by the algorithm has 48% of edges and 98% of vertices from the original network. If a two pass algorithm is possible to avoid adding unnecessary edges to obtain the VIE sampling quantities, the sampled graph has 29% and 96% of edges and vertices, respectively. There is a cost in this case of storing more 19% of edges with only one pass of the stream of edges which can not be avoided. The higher number of vertices in G s is mainly due to the network property M N and p e = 0.2.
The estimation of the distribution per vertex (plot (c)) using the penalized estimator (27) is slightly more accurate when compared to that in plot (a). The asymptotic estimation (34) performs similar in the tail in plot (c). This is due to the fact that with Algorithm 2, edges and vertices are effectively sampled at random (without replacement), while in Figure 3(a) the random walks used to sample vertices and edges at random (with replacement) do so only in the stationary regime. The estimation of the distribution per edge is given in Figure 3(d) using the penalized estimator (27), since now the matrix P e is ill-conditioned for small values of p v and p e .
We also consider several real-world networks from SNAP database: 1 a collaboration network from the e-print high-energy physics theory (Arxiv HEP-TH) with N = 8, 638 and M = 24, 806; a Facebook social network (gemsec-Facebook) with N = 50, 515 and M = 819, 090; and an Amazon product co-purchasing network (com-Amazon) with N = 334, 863 and M = 925, 872. Figure 4 shows the estimation of the triangle count distributions per vertex and edge for several sampling algorithms dependent on the network access scenario considered, with sampling rates of 20%. The two estimation methods show that the true triangle count distributions can be recovered quite accurately which agrees with the results for synthetic networks.
Finally, we comment on the relations between power-law exponents (Section 3.2.3). For the synthetic network, the estimated exponent of the triangle count distribution per vertex is β = 0.73 (using the maximum likelihood estimation in the formula (10.9) of Newman, 2018) while for the degree distribution, the exponent is α = 1.5 which agrees with Equation (45). For the com-Amazon network, we found β = 1.24 and α = 2.28.

Multiple samples
For the com-Amazon network in a restricted access scenario, we evaluate the estimation method of Section 4. Vertices are sampled using the MH random walk and edges with r independent standard random walks in parallel. Figure 5 shows the estimation of the distribution of triangle counts per vertex for r = 1, 2, 3. The size of each sample is 20% of the total number of the units. With only one sample of edges (r = 1), the distribution is overestimated when using the estimator (56). Increasing the number of samples of edges obtained in parallel, the bias of the estimation decreases, however, at higher sampling cost (with r = 3 effectively corresponding to sampling 60% of the edges). For r = 3, the performance is still worse than when using the inversion and asymptotic estimation with just one sample (Figure 4(c)). Similar findings hold for the estimation of the distribution of triangle counts per edges (Section 4.2).

Conclusions
In this paper, we focused on the triangle count distributions of vertices and edges in networks, and their estimation through a newly introduced sampling method (VIE sampling). Three sampling designs of VIE scheme were proposed motivated by common network access scenarios. We developed an estimation method based on inversion for the bulk of the distribution and an asymptotic method for the tail. We also proposed a single approach to estimate the entire distribution using multiple VIE samples collected in parallel. For the several network access scenarios, algorithms were presented to sample the network using random walks in the restricted access network scenario and using hashing in the setting of streaming edges. The proposed estimation methods were evaluated on several synthetic and real-world networks, showing a satisfactory performance for the inversion and asymptotic approach with a single sample, with a higher cost for the multiple samples approach.
Several open questions were already raised for future work. For example, one open problem concerns the relation between the power-law exponents of the degree distribution and the triangle count distribution per edge (see Section 3.2.3). In other directions, one could possibly consider graphs with repeated edges (multigraphs) or directed graphs, and count distributions per vertex and edge for higher order graphical structures other than triangles such as k-cliques.