mention.tech
Receiving webmentions for everyone
Receiving webmentions for everyone
The content of a post, as raw HTML (or not).
", "value": "The content of a post, as raw HTML (or not)." }] } An img/alt struct, containing the URL of a parsed image under value, and its alt text under alt. "properties": { "photo": [{ "value": "", "alt": "Example Person" }] } A nested microformat data structure, with an additional value key containing a plaintext representation of the data contained within. "properties": { "author": [{ "type": ["h-card"], "properties": { "name": ["Barnaby Walters"] }, "value": "Barnaby Walters }] } All properties may have more than one value. In cases where you expect a single property value (e.g. name), simply take the first one you find, and in cases where you expect multiple values, use all values you consider valid. There are also some cases where it may make sense to use multiple values, but to prioritise one based on some heuristic — for example, an h-card may have multiple url values, in which case the first one is usually the “canonical” URL, and further URLs refer to external profiles. Let’s look at the implications of each of the potential property value structures in turn. Firstly, Never assume that a property value will be a plaintext string. Microformats publishers can nest microformats, embedded content and img/alt structures in a variety of different ways, and your consuming code should be as flexible as possible. To partially make up for this complexity, you can always rely on the value key of nested structs to provide you with an equivalent plaintext value, regardless of what type of struct you’ve found. When you start consuming microformats 2, write a function like this, and get into the habit of using it every time you want a single, plaintext value from a property: def get_first_plaintext(mf_struct, property_name): try: first_val = mf_struct['properties'][property_name][0] if isinstance(first_val, str): return first_val else: return first_val['value'] except (IndexError, KeyError): return None Secondly, Never assume that a particular property will contain an embedded HTML struct — this usually applies to content, but is relevant anywhere your application expects embedded HTML. If you want to reliably get a value encoded as raw HTML, then you need to: Check whether the first property value is an embedded HTML struct (i.e. has an html key). If so, take the value of the html key Otherwise, get the first plaintext property value using the approach above, and HTML-escape it If neither is found, the property has no value. In Python 3.5+, that could look something like this: from html import escape def get_first_html(mf_struct, property_name): try: first_val = mf_struct['properties'][property_name][0] if isinstance(first_val, dict) and 'html' in first_val: return first_val['html'] else: plaintext_val = get_first_plaintext(mf_struct, property_name) if plaintext_val is not None: plaintext_val = escape(plaintext_val) return plaintext_val except (IndexError, KeyError): return None In some cases, it may make sense for your application to be aware of whether a value was parsed as embedded HTML or a plain text string, and to store/treat them differently. In all other cases, always use a function like this when you’re expecting embedded HTML data. Thirdly, when expecting an image URL, check for an img/alt structure, falling back to the plain text value (and either assuming an empty alt text or inferring an appropriate one, depending on your specific use case). Something like this could be a good starting point: def get_img_alt(mf_struct, property_name): try: first_val = mf_struct['properties'][property_name][0] if isinstance(first_val, dict) and 'alt' in first_val: return first_val else: plaintext_val = get_first_plaintext(mf_struct, property_name) if plaintext_val is not None: return {'value': plaintext_val, 'alt': ''} return None except (IndexError, KeyError): return None Finally, in cases where you expect a nested microformat, you might end up getting something else. This is the hardest case to deal with, and the one which depends the most on the specific data and use-case you’re dealing with. For example, if you’re expecting a nested h-card under an author property, but get something else, you could use any of the following approaches: If you got a plain string which doesn’t look like a URL, treat it as the name property of an implied h-card structure with no other properties (and if you need a URL, you could potentially take the hostname of the effective URL, if it works in context as a useful fallback value) If you got an img alt struct, you could treat the value as the photo property, the alt as the name property, and potentially even take the hostname of the photo URL to be the implied fallback url property (although that’s pushing it a bit, and in most cases it’s probably better to just leave out the url) If you got an embedded HTML struct, take its plaintext value and use one of the first two approaches If you got a plain string, check to see if it looks like a URL. If so, fetch that URL and look for a representative h-card to use as the author value If you get an embedded mf struct with a url property but no photo, you could fetch the url, look for a representative h-card (more on that in the next section) and see if it has a photo property Treat the author property as invalid and run the h-entry (or entire page if relevant) through the authorship algorithm The first three are general principles which can be applied to many scenarios where you expect an embedded mf struct but find something else. The last three, however, are examples of a common trend in consuming microformats 2 data: for many common use-cases, there are well-thought-through algorithms you can use to interpret data in a standardised way. Know Your Algorithms and Vocabularies The authorship algorithm mentioned above is one of several more-or-less formally established algorithms used to solve common problems in indieweb usages of microformats 2. Some others which are worth knowing about include: “Who wrote this post?”: authorship algorithm “There’s more than one h-card on this page, which one should I use?”: representative h-card “I want to get a paginated feed of posts from this page”: How to consume h-feed “How do I find and display the main post on this page?”: How to consume h-entry “I received a response to one of my posts via webmention, how do I display it?”: How to display comments Library implementations of these algorithms exist for some languages, although they often deviate slightly from the exact text. See if you can find one which meets your needs, and if not, write your own and share it with the community! In addition to the formal consumption algorithms, it’s worth looking through the definitions of the microformats vocabularies you’re using (as well as testing with real-world data) and adding support for properties or publishing techniques you might not have thought of the first time around. Some examples to get you started: If an h-card has no valid photo, see if there’s a valid logo you can use instead When presenting a h-entry with a featured photo, check both the photo property and the featured property, as one or the other might be used in different scenarios When dealing with address or location data (e.g. on an h-card, h-entry or h-event), be aware that either might be present in various different forms. Co-ordinates might be separate latitude and longitude properties, a combined plaintext geo property, or an embedded h-geo. Addresses might be separate top-level properties or an embedded h-adr. There are many variations which are totally valid to publish, and your consuming code should be as liberal as possible in what it accepts. If a h-entry contains images which are marked up with u-photo within the e-content, they’ll be present both in the content html key and also under the photo property. If your app shows the embedded content HTML rather than using the plaintext version, and also supports photo properties (which may also be present outside the content), you may have to sniff the presence of photos within the content, and either remove them from it or ignore the corresponding photo properties to avoid showing photos twice. Sanitise, Validate, and Truncate In the vast majority of cases, consuming microformats 2 data involves handling, storing and potentially re-publishing untrusted and potentially dangerous input data. Preventing XSS and other attacks is out of the scope of the microformats parsing algorithm, so the data your parser gives you is just as dangerous as the original source. You need to take your own measures for sanitising and truncating it so you can store and display it safely. Covering every possible injection and XSS attack is out of the scope of this article, so I highly recommend referring to the OWASP resources on XSS Prevention, Unicode Attacks and Injection Attacks for more information. Other than that, the following ideas are a good start: Use plaintext values where possible, only using embedded HTML when absolutely necessary Pass everything (HTML or not) through a well-respected HTML sanitizer such as PHP’s HTML Purifier. Configure it to make sure that embedded HTML can’t interfere with your own markup or CSS. It probably shouldn’t contain any javascript ever, either. In any case where you’re expecting a value with a specific format, validate it as appropriate. More specifically, everywhere that you expect a URL, check that what you got was actually a URL. If you’re using the URL as an image, consider fetching it an checking its content type Consider either proxying resource such as images, or storing local copies of them (reducing size and resolution as necessary), to avoid mixed content issues, potential attacks, and missing images if the links break in the future. Decide on relevant maximum length values for each separate piece of external content, and truncate them as necessary. Ideally, use a language-aware truncation algorithm to avoid breaking words apart. When the content of a post is truncated, consider adding a “Read More” link for convenience. Test with Real-World Data The web is a diverse place, and microformats are a flexible, permissive method of marking up structured data. There are often several different yet perfectly valid ways to achieve the same goal, and as a good consumer of mf2 data, your application should strive to accept as many of them as possible! The best way to test this is with real world data. If your application is built with a particular source of data in mind, then start off with testing it against that. If you want to be able to handle a wider variety of sources, the best way is to determine what vocabularies and publishing use-cases your application consumes, and look at the Examples sections of the relevant indieweb.org wiki pages for real-world sites to test your code against. Don’t forget to test your code against examples you’ve published on your own personal site! Next Steps Hopefully this article helped you avoid a lot of common gotchas, and gave you a good head-start towards successfully consuming real-world microformats 2 data. If you have questions or issues, or want to share something cool you’ve built, come and join us in the indieweb chat room.in reply to: @aaronpk
Trying out this guide to sending webmentions
Go ahead and copy that HTML and save it into a new file on your web server, for example: aaronpk.com/reply.html. Take your new post's URL and paste it into the webmention form at the bottom of this post. After a few seconds, reload this page and you should see your post show up under "Other Mentions"! Making it look better That's a great start! But you might be wondering where your comment text is. To make your comment show up better on other peoples' websites, you'll need to add a little bit of HTML markup to tell the site where your comment text is and to add your name and photo. Let's take the HTML from before and add a couple pieces.in reply to: @aaronpk
Trying out this guide to sending webmentions