Browse code

Storing copies of past gsoc proposals

Suggested by Roger in case the originals vanish. These are pdf printoffs
with the exception of hbock's DNSEL rewrite app, which didn't translate well.
This was a simple page, so just copying it.

I'm also replacing a simple txt copy of jvoisin's metadata toolkit proposal
with a much better looking pdf of his melange app (restricted to just the
content via firebug).

Damian Johnson authored on14/05/2011 21:14:00
Showing1 changed files
1 1
new file mode 100644
... ...
@@ -0,0 +1,450 @@
1
+<html><head>
2
+<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
3
+  <title>GSoC Application</title>
4
+     <style type="text/css">
5
+       body {
6
+         width:60%;
7
+         text-align:justify;
8
+       }
9
+     </style>
10
+  </head><body>
11
+<h2><a href="http://torproject.org/">Tor Project</a> - DNSEL Rewrite</h2>
12
+<b>Google Summer of Code Student Application</b> <br>
13
+Harry Bock &lt;hbock AT ele DOT uri DOT edu&gt;
14
+<blockquote>
15
+<h2>Abstract:</h2>
16
+
17
+<p>
18
+  The TorDNSEL project is concerned with identifying individual hosts
19
+  as valid and accessible Tor exit relays.  Each Tor exit relay has an
20
+  associated exit policy governing what traffic may leave the Tor
21
+  circuit and go out as requests to the internet.  A public database
22
+  that can be easily queried or scraped would be of huge benefit to
23
+  the Tor community and to services that are interested in whether
24
+  clients originate from the Tor network, such as Wikipedia and IRC
25
+  networks.
26
+</p>
27
+</blockquote>
28
+
29
+<ol>
30
+<li>
31
+  <strong>
32
+    What project would you like to work on? Use our ideas lists as a
33
+    starting point or make up your own idea. Your proposal should
34
+    include high-level descriptions of what you're going to do, with
35
+    more details about the parts you expect to be tricky. Your
36
+    proposal should also try to break down the project into tasks of a
37
+    fairly fine granularity, and convince us you have a plan for
38
+    finishing it.
39
+  </strong>
40
+  <p>
41
+    My primary interest is in the TorDNSEL rewrite.  Currently
42
+    unmaintained and written in Haskell, I would like to rework it
43
+    from the ground up using Python and the Torflow interface.
44
+  </p>
45
+  <ul>
46
+  <li>
47
+    Prior to actually building the new TorDNSEL, some time must be
48
+    spent researching and testing strategies for identifying and verifying
49
+    recent Tor exit relays.
50
+    <ul>
51
+      <li>
52
+	Much of this information is available straight from the
53
+	cached-descriptors file distributed to running Tor relays, but
54
+	not all of it is accurate or up-to-date.
55
+      </li>
56
+      <li>
57
+	<p>
58
+	  To ensure the honesty of advertised exit nodes, the program
59
+	  must actively build circuits over the Tor network to the
60
+	  intended exit node and verify the IP address and exit policies
61
+	  listed in the cached router list.  This will be accomplished via
62
+	  TorFlow, most likely using the NetworkScanners tool SoaT.  If
63
+	  more detailed checking is required than provided by SoaT, it will
64
+	  be modified and extended to suit the new requirements.
65
+	</p>
66
+      </li>
67
+      <li>
68
+	<p>
69
+	  Since new relays entering the Tor network are almost
70
+	  immediately available for use, it is important that new
71
+	  relays are checked and added as quickly as possible.
72
+	  Testing ExitPolicy honesty may be time-consuming for
73
+	  certain relays, destinations, and services.
74
+	</p>
75
+	<p>
76
+	  To improve exit honesty checking latency, hosts that have
77
+	  complex exit policies may be checked incrementally; for
78
+	  example, if popular services such as http/https/domain/ssh
79
+	  are allowed by the relay's ExitPolicy, these should be
80
+	  checked first and the relay can be marked as "honest
81
+	  (preliminary)", so that partial results may be listed
82
+	  earlier pending more thorough circuit testing.
83
+	</p>
84
+      </li>
85
+    </ul>
86
+  </li><li>
87
+    Once the desired method for compiling and verifying exit relays
88
+    is tested, a formal design specification for the following operations
89
+    must be compiled:
90
+    <ol>
91
+      <li><p>Scraping of all known exit relays and their exit
92
+	policies.</p>  As noted in the Tor dir-spec v3, in the future
93
+	it will not scale to have every Tor client/directory cache know
94
+	the IP of every other router.  We need to be able to accurately
95
+	obtain this data, up-to-date and in bulk, from an authoritative
96
+	source.  The DNSEL should be able to verify all exit relays,
97
+	in the shortest time span allowable.
98
+      </li>
99
+      <li>
100
+	Active testing for valid exit IP addresses and policies.
101
+      </li>
102
+      <li>
103
+	Parameters and raw format of the resulting data.
104
+      </li>
105
+      <li>
106
+	Formal mechanisms and formats for access by consumers of this
107
+	data.  As a minimal starting point, SQLite and JSON should be
108
+	supported for build data sets, as both are well-structured,
109
+	standardised formats that have cross-platform open-source
110
+	support for the most widely used programming languages. One this
111
+	data is readily available, it would be extremely useful to provide
112
+	a simple API in Python to improve ease-of-use and integration of
113
+	this data with existing services.
114
+      </li>
115
+    </ol>
116
+  </li>
117
+  <li>
118
+    <p>
119
+      Currently, users can check the TorBulkExitList on
120
+      check.torproject.org or perform DNS queries against
121
+      exitlist.torproject.org, but this is not ideal for all consumers
122
+      of this data; currently it is expensive to perform these queries
123
+      and they must be done one at a time over the network. While this
124
+      is less of an issue for services that need to perform infrequent
125
+      exit checks, I propose that this mechanism can actually harm
126
+      anonymity, as an adversary that can track these queries (those
127
+      with access to the network of the querying service, or those
128
+      running rogue DNSEL implementations, should it become more
129
+      distributed) can determine, for example, what exit nodes are
130
+      currently serving a large amount of IRC connections.  This is
131
+      more important for services that do frequent queries on a wide
132
+      range of IP addresses.  By encouraging these queries to be done
133
+      locally, we can improve network latency, throughput, and
134
+      anonymity together.
135
+    </p>
136
+    <p>
137
+      On the other hand, for some services, requiring services to
138
+      download the entire exit set when they only need to query a few
139
+      addresses daily would be a waste of valuable resources on both
140
+      ends.  Thus, both single-exit queries and bulk exit description
141
+      lists must be provided; raw queries would be used by services
142
+      that are simply doing manual checks of possible Tor addresses,
143
+      or infrequent automated checks.
144
+    </p>
145
+    <p>
146
+      The primary consumers of bulk data are services that need
147
+      to do frequent automated checks and benefit strongly from
148
+      local caching of data. Some prominent examples would be:
149
+      </p><ul>
150
+	<li>The Tor project itself (see TorBulkExitList.py,
151
+	  check.torproject.org, etc.).</li>
152
+	<li>IRC networks such as Freenode and OFTC that
153
+	  automatically scrub any potentially identifiable
154
+	  information from WHOIS queries.</li>
155
+	<li>Social networks or collaborative communities such as
156
+	  Wikipedia that would benefit from knowing if a
157
+	  particular IP address is shared, allowing them to
158
+	  (possibly) be more lenient towards abuse from
159
+	  anonymizing services.
160
+	</li>
161
+      </ul>
162
+    <p></p>
163
+    <p>
164
+      A particularly important goal is to ensure that Tor users
165
+      that also happen to run Tor relays are not automatically
166
+      blocked by services simply because they are relays or exit
167
+      nodes. If a service is able to ascertain that an IP address
168
+      corresponds to a Tor relay, but its exit policy would not
169
+      allow traffic to access the service from Tor, ideally it
170
+      should not block access to its resources.  The cheaper and
171
+      easier it is for a service operator to validate this kind of
172
+      information, the more likely the service is to use it and
173
+      collaborate with the Tor community.  The more the service is
174
+      used, the more likely we are to get feedback, whether
175
+      it be able the data format, false positives or negatives,
176
+      invalid/incorrect exit policies, etc.
177
+    </p>
178
+  </li>
179
+  <li>
180
+    The TorDNSEL project itself will likely benefit from being broken
181
+    into several components that interact with each other and operate on
182
+    exit list data sets.
183
+    <ul>
184
+      <li>Data scraper and honesty checking daemon: constantly runs in
185
+      the background, fetching new information about Tor relays and
186
+      their exit policies and updating the raw journal.  As new relays
187
+      come in, they are accessed and verified for honesty and
188
+      correctness.  As relay data becomes stale according to a
189
+      configurable TTL, they are re-checked.
190
+      </li>
191
+      <li>
192
+	Data compiler: Manages, updates, and merges exit lists.  Can
193
+	import and export data from bulk formats and provide
194
+	statistics about data sets.
195
+      </li>
196
+      <li>
197
+	Single-query engine: Allows applications and external services
198
+	to query about a single IP address and exit policy at a time.
199
+	Can interface via several convenient methods, starting out
200
+	simple (e.g., Python API requests or HTTP requests) and
201
+	eventually working up to RBL-style DNS queries, like the
202
+	original DNSEL, should time permit.
203
+      </li>
204
+      <li>
205
+	Data distribution: How do clients actually access bulk data?
206
+	This doesn't necessarily have to be its own application, but
207
+	it must be feasible for clients to easily scrape and update
208
+	their local caches.  The data may be served by a static web
209
+	server, like lighttpd, or via other mechanisms (sftp, or dns
210
+	zone transfer as a more convoluted example).
211
+      </li>
212
+    </ul>
213
+  </li>
214
+  <li>
215
+    As an aside, there are several open and accepted Tor proposals
216
+    that are relevant to the work I will complete:
217
+    <ul>
218
+      <li>140 <b>Provide diffs between consensuses</b> (ACCEPTED)</li>
219
+      <li>146 <b>Add new flag to reflect long-term stability</b> (OPEN)</li>
220
+      <li>147 <b>Eliminate the need for v2 directories in generating v3 directories</b> (ACCEPTED)</li>
221
+      <li>159 <b>Exit Scanning</b> (OPEN)</li>
222
+    </ul>
223
+  </li>
224
+
225
+  <!--
226
+      <li>
227
+    From what I have researched and discussed so far, the tricky parts
228
+    will be verifying that listed exit nodes and their exit policies
229
+    are correct (as advertised) and presenting this data in a useful
230
+    way to dnsel consumers.  It was originally assumed that consumers
231
+    of dnsel records would want to use a DNS RBL-style interface, and
232
+    while this interface has been useful to (for example) IRC server
233
+    operators, these direct queries end up revealing more about the
234
+    usage of exit nodes than is necessary.
235
+  </li>
236
+      -->
237
+</ul>
238
+<p></p>
239
+<b>Estimated timeline:</b>
240
+<p>For each weekly milestone, appropriate documentation should be written to
241
+coincide with the completed work.  For example, end-user tools should have
242
+manual pages at the very least, and preferably include a LaTeX manual.  Milestones
243
+that are primarily experimental in nature should include complete descriptions and
244
+proposals in plain-text where appropriate.  All source code will be thoroughly
245
+commented and include documentation useful to developers.
246
+</p> 
247
+<p>
248
+  <b>April 26 - May 24 (Pre-SoC)</b>: Get up to speed with Tor
249
+  directory and caching architecture, pick apart existing Haskell
250
+  implementation of TorDNSEL, and master TorFlow.
251
+</p>
252
+<p>
253
+  <b>May 31 (end of week 1)</b>: Have a working mechanism for
254
+  compiling as much testable information about exit relays as
255
+  possible. This data must be easily accessible for subsequent work.
256
+  This may be taken, adapted, or abstracted from existing data directory crawling
257
+  in TorFlow.
258
+</p>
259
+<p>
260
+  <b>June 14 (week 3)</b>: Working implementation of tests using TorFlow, especially
261
+  ExitAuthority tools.  This will probably be the most time-consuming period; may take
262
+  up to a week more than anticipated.
263
+</p>
264
+<p>
265
+  <b>June 21 (week 4)</b>: Be able to produce consistent, constantly updating exit lists
266
+  with tested and untested exit policies listed.  Find Tor developer guinea pigs to
267
+  test and hunt for glaring holes in exit relay honesty testing and verification. :)
268
+</p>
269
+<p>
270
+  <b>June 28 (week 5)</b>: Begin proof-of-concept production of bulk
271
+  data formats (raw, SQLite and JSON), all of which should be similar
272
+  in format.  Consultations should be made with consumers of such data
273
+  (Freenode, Wikipedia, etc.) to ensure the current data presentation
274
+  is not overreaching or missing information that would be useful to
275
+  them.  
276
+</p>
277
+<p>
278
+  <b>July 12 (week 7)</b>: Integrate existing functionality and data access methods into
279
+  a Python API that is usable for consumers and the DNSEL application itself.  Style should
280
+  be similar to TorFlow where possible.
281
+</p><p>
282
+  <b>July 16 (midterm evals):</b> Completed specifications of TorDNSEL
283
+  operation, basic data formats, and delivery methods.  Completed
284
+  first proof-of-concept implementation.  First major review with
285
+  mentors and as much of the Tor developer community as time permits.
286
+</p>
287
+<p>
288
+  <b>July 19 (week 8)</b>: Work on designing and testing exit list cache update mechanisms.
289
+  Start with something similar to cached-descriptors.new journaling, and work up
290
+  to something for useful for other data formats.  Integrate mechanism into API.
291
+</p>
292
+<p>
293
+  <b>July 26 (week 9)</b>: Solidify main scrape/check application and
294
+  perform as much real-world testing as time permits, adjusting for
295
+  major setbacks, if any.
296
+</p>
297
+<p>
298
+  <b>August 2 (week 10)</b>: Make adjustments based on feedback from
299
+  (hopefully) several real-world consumers of TorDNSEL data.
300
+  Generally polish and improve usability of core application(s).
301
+</p>
302
+<p>
303
+  <b>August 9 ("pencils down"):</b>: Start pumping out documentation and comprehensive
304
+  code and review.
305
+</p>
306
+<p>
307
+  <b>August 16 ("okay really, pencils down"):</b>Major remaining kinks
308
+  should be ironed out; polish specification and documentation and
309
+  begin writing final evaluations. Plan for future maintenance of
310
+  TorDNSEL.
311
+</p>
312
+</li>
313
+
314
+<li>
315
+  <strong> Point us to a code sample: something good and clean to
316
+  demonstrate that you know what you're doing, ideally from an
317
+  existing project.
318
+  </strong>
319
+  <p>
320
+    Code from almost any project I've worked on is available at
321
+    http://git.spanning-tree.org/.  Some of my better code:
322
+    </p><ul>
323
+      <li><a href="http://git.spanning-tree.org/index.cgi/grizzlor/">libgrizzlor</a> [C, Common Lisp], an abstraction layer for the
324
+	SILC client library focused on bots.
325
+      </li>
326
+      <li><a href="http://git.spanning-tree.org/index.cgi/rigel/tree/">rigel</a> [C], a UNIX PIC16/PIC18 program loader for use with the FIRST robotics competition.</li>
327
+      <li><a href="http://git.spanning-tree.org/index.cgi/nis/">Network Subsystem Inventory</a> [Python], a Django web application for keeping an inventory of network resources on a large university network. </li>
328
+      <li><a href="http://git.spanning-tree.org/index.cgi/Periscope/">Periscope</a>
329
+ [Common Lisp], a network monitoring application inspired by IP-Audit, 
330
+designed from the ground up to work with the Argus netflow application.</li>
331
+    </ul>
332
+  <p></p>
333
+</li>
334
+<li><strong> Why do you want to work with The Tor Project / EFF in particular?</strong>
335
+<p>
336
+The Tor Project interests me primarily from architectural and
337
+information security perspectives; my primary focus in information
338
+security has always been authentication and authorization - verifying
339
+the identity of a user to explicitly or implicitly control access to
340
+machine and network resources.  The goal of all forms of public-key
341
+and secure hash cryptography is the authentication of a third party or
342
+data, essentially pinning their identity down.
343
+</p>
344
+<p>
345
+Tor greatly interests me because it has the opposite goal; it tries to
346
+ensure that pinning down the identity of any particular user is
347
+(ideally) impossible or at least greatly hindered for any non-global
348
+adversary.  Protecting the rights of network users by preserving their
349
+anonymity is an incredibly important and complicated goal, and Tor's
350
+role in increasing anonymity of internet access in the face of many
351
+types of adversaries is extremely valuable.  To this end, I hope that
352
+my contributions will be found useful by the Tor project, its users,
353
+and those working to protect these end users.
354
+</p>
355
+</li>
356
+<li>
357
+  <strong>
358
+    Tell us about your experiences in free software development
359
+    environments. We especially want to hear examples of how you have
360
+    collaborated with others rather than just working on a project by
361
+    yourself.
362
+  </strong>
363
+    <p>While nearly all of the projects I've worked on have been free
364
+    software, my experience working directly with the free software
365
+    community at large is minimal.  I have contributed briefly to the
366
+    KDE project, working on their display configuration application,
367
+    and submitted patches to other open source projects (QoSient's
368
+    Argus netflow tools and Google's ipaddr-py, for example).  I have
369
+    collaborated with various universities in New England on
370
+    development of the Nautilus project (http://nautilus.oshean.org/)
371
+    and its main subproject, Periscope
372
+    (http://nautilus.oshean.org/wiki/Periscope), while working at the
373
+    OSHEAN non-profit consortium. </p>
374
+    <p>
375
+I sincerely look forward to working with the vibrant development
376
+community of the Tor project and hope to gain more experience in
377
+collaborating with an experienced group of developers.
378
+</p>
379
+</li>
380
+
381
+<li>
382
+  <strong>
383
+    Will you be working full-time on the project for the summer, or
384
+    will you have other commitments too (a second job, classes, etc)?
385
+    If you won't be available full-time, please explain, and list
386
+    timing if you know them for other major deadlines
387
+    (e.g. exams). Having other activities isn't a deal-breaker, but we
388
+    don't want to be surprised.
389
+  </strong>
390
+  <p>
391
+    I will be working part-time at the University of Rhode Island
392
+    Information Security Office, and will have one summer class for five
393
+    weeks starting in late May.  I don't anticipate either will
394
+    significantly affect my involvement with the Tor project.
395
+  </p>
396
+</li>
397
+<li>
398
+  <strong>
399
+    Will your project need more work and/or maintenance after the
400
+    summer ends? What are the chances you will stick around and help
401
+    out with that and other related projects?
402
+  </strong>
403
+<p>
404
+While I am confident I can produce a working initial implementation of
405
+dnsel in the time allotted, I anticipate it will need more work at the
406
+end of summer.  One of my primary goals for the dnsel project is to
407
+make it easier to maintain, as its operation will have to be adjusted
408
+to fit with changes in the Tor architecture.  Making the project more
409
+accessible to other maintainers will allow for greater collaboration
410
+and improvements to dnsel where development on the current
411
+implementation has stagnated.
412
+</p>
413
+</li>
414
+<li>
415
+  <strong>
416
+    What is your ideal approach to keeping everybody informed of your
417
+    progress, problems, and questions over the course of the project? Said
418
+    another way, how much of a "manager" will you need your mentor to be?
419
+  </strong>
420
+  <p>
421
+    I will do my best to communicate with my mentors and the Tor developer
422
+    community at large as frequently and directly as possible, via
423
+    #tor-dev and the mailing lists.  I also hope to inform others of more
424
+    major milestones in the project via a blog or web page, and keep
425
+    detailed documentation and progress updates on the Tor wiki.
426
+  </p>
427
+</li><li>
428
+  <strong>What school are you attending? What year are you, and what's 
429
+your major/degree/focus? If you're part of a research group, which one?</strong>
430
+  <p>
431
+    I am currently attending the University of Rhode Island.  This is my
432
+ fourth year in college and second at URI; I am a Computer Engineering 
433
+major, intending to graduate next year and obtain my masters degree the 
434
+following year.  My primary interests are low-level software development
435
+ and systems programming, networking, information security, and signal 
436
+processing.
437
+  </p>
438
+</li>
439
+<li>
440
+  <strong>How can we contact you to ask you further questions? Google 
441
+doesn't share your contact details with us automatically, so you should 
442
+include that in your application. In addition, what's your IRC nickname?
443
+ Interacting with us on IRC will help us get to know you, and help you 
444
+get to know our community.</strong>
445
+  <p>
446
+    You can contact me at hbock@ele.uri.edu; my nickname on IRC is <b>hbock</b>.
447
+  </p>
448
+</li>
449
+</ol>
450
+</body></html>
0 451
\ No newline at end of file