Module 0 FAQ
Image recognition for interpretation of medical X-rays.
The inability to provide explanations is a weakness in today’s AI systems, and it’s one of the reasons for the progression to the current Third Wave of AI, where the weakness is directly addressed. Many research programs are now turning to the question of explainability and working toward AI systems that provide accompanying explanations for how a conclusion was reached rather than just a single conclusive output. This problem has yet to be solved, and to do so, context must be considered, so contextual reasoning is a major thrust of the Third Wave of AI.
There are many motivations for designing an AI to engage with a hacker, including goals such as “waste the attacker’s time to keep them from attacking other people” or even “unveil the attacker.” More to come on this topic in Modules 7 and 8.
Module 1 FAQ
AI solutions are often applied in situations where the input is unknown or where the problem requires prediction. A running joke back in the early days of AI was the notion that, when a solution was developed for a very hard problem, the problem was no longer considered AI.
Let’s take chess as an example. As it turns out, brute force chess actually works quite well. Does that mean “brute force” could actually be considered intelligence? Maybe not, but AI techniques may be incorporated into the solution. In fact, the solution space may become so huge (especially with brute force techniques) that it may not be practical to expand out all possible moves from beginning to end of a game.
Heuristics are a concept often applied in AI solutions to prune the search space. (We’ll learn more about this in Module 2.) Technically the result is an “approximate solution,” not necessarily a complete or optimal solution as one might find in non-AI problems. So even for games like chess, application of AI techniques has relevance. The problem is not fully “solved” without quite intensive resources, although it can be approximately solved by applying heuristics.
When we refer to the “tradeoff between size of model and precision,” we mean it’s possible that you could add many GB toward the model size for only a 1-2% improvement in accuracy. Also, the size of the model can bog you down and take days of processing for the solution to one problem. So, you trade a bit of accuracy in order to save some time. Various websites provide models with information about accuracy/size tradeoff, and you can use these to make an informed choice.
Users of AI solutions have different needs, and the highest-performing algorithm may not always be the best choice. For example, a fast algorithm may be sub-optimal, or it may be optimal but hard to use. Or it may favor high recall, but you’re interested in high precision. Pick the option that best suits your needs.
Module 2 FAQ
Rule-based and knowledge-based methods are in the First Wave of AI. Logical reasoning is associated with First Wave techniques, but the cumbersome nature of hand-built rules made it difficult to progress beyond small-scale/domain-specific applications. The First Wave of AI provided the foundation for the Third Wave (explainable AI systems), but did not quite progress far enough to achieve this before the emergence of the Statistical Revolution (the Second Wave).
Module 3 FAQ
Your representation determines the answer to this question. If you represent height as a numeric value, two heights can be added together or their difference can be compared. They can even be divided or multiplied (as in ratio data values) in scenarios such as “John is half the height of his pet elephant.” However, for the purposes of this AI course, “height” is represented as “short,” “medium,” and “tall,” so it cannot be taken an interval or ratio attribute. To learn more, see Module 3.1.
Outliers are data that have properties differing significantly from most of the other data in the set. Thus, they aren’t a priori an indicator of a data quality issue. If discovered, outliers require further investigation to confirm that they are truly outliers and not errors (i.e., noise). If they are determined to be noise, they pose a data quality issue that requires resolution. By contrast, true outliers are often indicators of “interesting” cases, even in very high quality data. Training an algorithm to detect anomalies and find these “interesting cases” (e.g., fraudulent or uncommon activities) may require some form of outlier filtering to generate a training set. To learn more, see Module 3.4.
Noise includes an element of error: errors introduced by a human, or by faulty algorithms or sensors, at the time the data is created or collected. By contrast, outliers are valid data points that fall outside of the typical data that is expected for a particular dataset.
Consider, for example, a dataset of all the words one person recorded via voice recognition software over a period of a year in which there is only a single occurrence of the word onomatopoeia. This word is valid but not within the typical or expected vocabulary of the person who uttered it; thus, it is considered an “outlier.” However, if the voice recognition software misrecognizes the word bare for the word bear, and this output will later be used for training an NLP system, this misrecognition error is considered “noise.”
Indeed, there are two notions of time. The first is a time stamp, such as a date (March 3, 2020) or a time of day (1pm). These are interval notions in that they can be ordered and they have additive properties (e.g., 1pm is one hour before 2pm). The second is a time duration (2-hour meeting, 3-week vacation). These are ratio notions. Note also that Centigrade/Fahrenheit are interval notions and Kelvin is a ratio notion (due to a universally meaningful notion of 0). The lecture summary slides clarify these definitions.
No. In fact, we have moved a step forward in robustness by employing statistical methods (the Second Wave revolution), but we have taken a step backward in our ability to explain algorithmic outcomes/behaviors as we would have been able to if rules were employed (as they were in the First Wave). In the Third Wave, which is still underway and evolving, the benefits of both the First Wave and the Second Wave are leveraged.
Module 4 FAQ
A null hypothesis (H0) is a hypothesis that says there is no statistically significant relationship between two variables in the hypothesis, and any appearance of such a relationship is by chance. For example, imagine a person wants to study the impact of two different types of water, distilled and undistilled, on plant growth. In this example, the two variables are “type of water” and “plant growth.” What’s the null hypothesis? The null hypothesis would be something like, “There is no statistically significant relationship between the type of water fed to the flowers and growth of the flowers.” An alternative hypothesis (H1) is the idea that there is such a relationship, for example: “Distilled water promotes plant growth (in comparison to undistilled water)” or “Undistilled water promotes plant growth (in comparison to distilled water).” Note that there is only one possible null hypothesis, but there may be more than one alternative hypothesis. (See also Understanding Null Hypothesis in your supplemental readings.)
Module 5 FAQ
Word embedding is the process of assigning a vector value to a word which correlates to the meaning of the word. This vector value is a numerical representation of the texts that allows computers to handle these meanings.
ML and Deep Learning Architectures are incapable of processing strings of plain text in their raw form. The numeric values allow the computer to process the meaning and perform actions such as classification and regression on the vectors. The word embedding format maps a word using a “dictionary to vector” method. This is typically achieved through specific algorithms for calculating embeddings, including word2vec, fastText, ELMo, etc. These algorithms define the values of words based on the semantic meaning of the word. Therefore, words that are used in similar contexts will be given similar representations (i.e., they will be placed close together within the high dimensional semantic space.) These points will cluster together, and their distance to each other will be low.
These algorithms can be divided into two subcategories which define how the words are classified: frequency-based embedding (such as count vector and co-occurrence vector) and prediction-based embedding (which leverages neural networks, such as continuous bag of words). It’s important to note that in most representations, there is no explicit connection between a numerical value in the vector to any sort of interpretable meaning. Thus, the vector itself does not have an inherent meaning. It solely captures meaning based on its distance to other words.
Programs like word2vec also allow the user to train their own word vectors, via a format in which every document is contained in a list and every list contains a list of tokens of that document. This type of custom programming requires extensive preprocessing.
In summary: no, the vector values of words are not assigned at random. The values come from the algorithm that processed the word and that value is dependent upon the training of the algorithm. The processing algorithm could technically be trained to map each character of the word, but typically the whole word or even a word phrase is processed together to collect an overall semantic meaning.
When only the predicted word is being considered, the probability value comes from the chosen word domain. The probability value is the likelihood of that word appearing in the text according to the chosen word domain. For example, the likelihood of the word wrench appearing in text about a mechanics shop is higher than that of a word like sprint, but the opposite would be true if the domain of sports were of interest. Note that unigram counts are generally useful in conjunction with n-gram probabilities to improve the fit of the model to vocabulary specific to the domain of interest.
Semantic parsers are difficult to apply to tasks because annotating the data for them is time consuming and people do not agree on how to represent concepts in a way that enables generalization. However, semantic parsers capture subtle distinctions that would not otherwise be captured (e.g., by syntactic parsers). Thus, it is worth exploring ways to circumvent such issues.
Accurate semantic parsers need custom rules fashioned in a way that represents how the data was annotated, but these hand-built rules may also incorporate language model probabilities or apply graph-based probabilistic methods for determining the correct structure. Such hybrid approaches benefit from the “best of both worlds.”
There are some methods that are currently used to improve semantic parsers and find general agreement on the underlying representations of concepts. One is by verifying that the semantic parser captures all of the relevant meaning for a wide variety of sentences, usually determined by a group of human judges. Another is verifying that the output structure of a semantic parser leads to valid entailments of sentences in some agreed-upon logical reasoning system. Finally, parsers will often build upon or extend existing resources and representations, even if they are entirely consistent or complete. The advantage of this approach is that, even without a broad consensus in representation, there will at least be shared resources that can foster collaboration and parallel development of knowledge bases.
Module 6 FAQ
Module 7 FAQ
First, it is important to realize that parsers are generally flawed. In fact, even small tweaks such as punctuation and capitalization can unexpectedly change the structure. Secondly, even among language experts, there is disagreement as to what the appropriate structure might be. For any system development (including for a task like “ask detection”), it is thus critical to make use of more than one tool to extract the critical components for achieving that task, in this case constituency parsing, dependency parsing, and semantic-role labeling.
For example, a subordinate verb such as “want” may be of interest, especially if that verb relates to a category of interest for the task (e.g., GIVE or PERFORM, in the case of ask detection), and different parsers may represent this verb in different ways. However, the logical form (dependency tree) shown for this very complex example (“the amount I want is $100….”) matches what many linguists might consider to be appropriate.
To explain why, let’s reduce the sentence to its simplest form: “The amount is 100 dollars.” This sentence is what is referred to as a “copula” construction, where “is” has only an auxiliary function, rendering the sentence as somewhat of an equality statement: “Amount = $100.” The phrase “I want” is a modifier of “amount” (“the amount that I want…”), so the subject of the sentence is “amount,” and the predicate is the material after the auxiliary “is.” Therefore, this sentence’s predicate is indeed “100 dollars.” Since “100 dollars” is headed by “dollars” with the modifier 100, the term “dollars” (represented as “$”) is the overall predicate and is accordingly positioned at the top of the dependency tree. It turns out this top node is actually what the social engineer wants: the target of the ask. This makes the dependency representation convenient to use for identifying ask targets.
The constituency tree provides a different perspective on the tree, demoting the dollar ($) to a modifier of the phrase “$100 each.” However, both the dependency tree and the constituency tree highlight “want” as a predicate within the subject of the subsentence “the amount I want,” and this predicate maps to the ask type GIVE. Taken together, the ask type and ask target provide the basis for an ask-detection output of <GIVE[want[$100[finance_money]]]>.
Note that ask detection is a step-by-step process that leverages the constituency and dependency representations to locate the main terms (mostly verbs, like want) that might yield an ask type. Semantic-role labeling further solidifies that choice of ask targets (e.g.,“5 gift cards” in the phrase “purchase 5 gift cards”). Confidence scores are assigned to ask-detection outputs based on the strength of evidence from a range of different indicators, including the output of constituency/dependency parsing and semantic-role labeling. It is never just one type of linguistic tool that leads to a high-confidence output. Rather, high confidence is reached only through the application of multiple tools, each of which yields indicators and/or double checks on the output of other tools.
There is a vast difference in the handling of query/phrase-based (structured) questions and open-ended (unstructured) questions by chatbots. For the most part, one can expect structured queries to be accurately answered, especially straightforward “what” questions like “What is the capital of England?” or “What is the special on Tuesdays at Joe’s Diner?”
But even state-of-the-art chatbots have difficulty answering open-ended questions, such as “Would it make sense for me to walk home along Springer Street after 8pm?” or “Why is it more convenient for me to drive to Gainesville from Pensacola than fly?” These are open-ended questions that a chatbot generally cannot answer. Manually coded rules, such as decision trees, are typically used to produce responses, and the space of options is infinite for the types of questions a human might answer. In general, a chatbot cannot answer open-ended questions but can certainly hedge, e.g., “Can you state that a different way?” This distinction boils down to the differences one would expect between the way machines handle questions and the way chatbots handle questions. See for example:
Module 8 FAQ
The primary distinction between GIVE and PERFORM, for the purposes of this module, is the use (or lack thereof) of structural knowledge. The argument of both asks is assigned the label ‘finance_money,’ but the difference is the link that has been included in the second case. “Buy me,” “send me,” “give me”: These are all examples of a GIVE, but the requests don’t provide any instructions. There’s no formal operation, like clicking on an upload link or sending to a particular email address, and so there’s no PERFORM action here.
By contrast, “Contact me at this link,” “Email me at this address,” “Upload your bank statement to this website,” “Pay $1 here”: these are all clear-cut PERFORM operations. PERFORM is generally related to a link that the user will click, or some other type of electronic operation like download/upload. Note: The nature of the structural information might be such that punctuation intervenes between the main part of the sentence and the structural information (i.e., the link to the email address, website, etc.). For example, “To send $1, click here,” is a single PERFORM, with send as the predicate and the argument $1 associated with the index (0) for a link for the word here:<PERFORM[send[$1(0)[‘finance_money’]]]>The same structure is obtained if the comma is dropped: “To send $1 click here.” If a word like “pay” is used instead of “send,” the structure also remains analogous.
In Module 7, we learned that identification of the arguments of an ask in a spear phishing attack requires parsing to extract the ask and semantic-role labeling to isolate the ask’s arguments (e.g., ARG1). What about named entity recognition? Isn’t that needed for complete identification of ask arguments? For example, a video in Module 8 indicates that an argument is identified as “finance_money” through named entity recognition (NER). Is this only for framing detection, not ask detection? And is there a reason for this distinction?
The core of ask detection is parsing and semantic role labeling. Specifically, identification of the predicate and arguments (using parsing and semantic role labeling) is the most critical task underlying ask detection. However, parsing and semantic role labeling are not sufficient for identifying the type of the argument, which is an auxiliary aspect of ask detection, as introduced in Module 8. Identification of the argument type can be done with a more general named-entity recognizer (PERS, ORG) or with a variant of named-entity recognition that labels arguments with more specific information (e.g., finance_money, personal_DOB). The latter is required for ask detection, and this is currently achieved with an in-house tool that matches arguments to domain-specific categories. However, parsing and semantic role labeling are sufficient for argument identification, which is a first step in ask/framing detection. Argument type identification can be applied in a later step of ask/framing detection (through named-entity recognition) once the arguments are identified.
Module 9 FAQ
Consider the MT output “Mary will run track on Mondays,” and the two references “Mary will run on Mondays” and “on Mondays Mary will run track.” What would the ROUGE score be if we were to ignore uniqueness in the formula?
Let’s try it both ways:
MT output: “Mary will run track on Mondays”
Bigram breakdown in output: (Mary will) (will run) (run track) (track on) (on Mondays)
Total unique bigrams in output: 52
References: “Mary will run on Mondays” and “on Mondays Mary will run track”.
Bigram breakdown in references: (Mary will) (will run) (run on) (on Mondays)-(on Mondays) (Mondays Mary) (Mary will) (will run) (run track)
Total unique bigrams in references: 6
The number of (unique) reference bigrams in the output is 5. The total number of (unique) reference bigrams is 6. So dividing 5 by 6 yields a ROUGE score of 0.83. However, if we were to ignore the “uniqueness” requirement, the 9 (non-unique) bigrams in the references would yield a ROUGE score of 5/9 = 0.55. This somewhat low ROUGE score is unlikely to be an adequate characterization of the high coverage provided by this reasonable MT output, Mary will run track on Mondays. Thus, it is critical not to take uniqueness into account when counting bigrams for ROUGE.
Indeed, in many cases, the result will be the same, but in others it is likely the human will bring more knowledge to bear on the form and content of the output, such that semantic equivalence with a reference is likely to be achieved with higher accuracy. For example, TERP may be too generous, e.g., over-matching “canines” to “dogs,” when a human may elect not to do this. There also may be cases where TERP is too stingy, e.g., under-matching “ran after” to “chased” if the system output were “the dogs ran after the cats,” whereas a human may elect to consider “ran after” and “chased” to be semantically equivalent.
In HTER, the human can choose to edit an output based on subtle distinctions that TERP doesn’t know about. For example, TERP’s lexical resources (which may group “canines” and “dogs” together into an equivalent class) may not capture distinctions that a human may know about: most notably that the term “canine” is more general than the term “dog.” Canines include dogs, wolves, foxes, etc. The human may determine, if provided with the context of the sentence (which is generally available in the human’s editing environment) that “dogs” is more appropriate than “canines.” TERP will not have any knowledge about the range of species included in the canine genus, nor will it have any extra-sentential context or other common sense knowledge, e.g., that “cats” are probably better paired with “dogs” than with “canines.”
In HTER, the human makes the minimal number of changes to create an output that is semantically equivalent to the ground truth. The result may look quite different from both the MT output and the human reference on the surface, e.g., “the canines chased the felines,” if that conveys the meaning at an appropriate level of abstraction. Once the human is done editing, the edits are automatically counted up for application of the TER formula.
Note: In general, human-in-the-loop evaluation provides a higher-quality evaluation than automated approaches, but it is likely to be a lot more expensive. Such tradeoffs need to be taken into account with respect to the goals and resources of the AI algorithm developer.
In that diagram, the column of samples on the left are all the truly positive cases (which is why this column is marked “Actual Class Label = 1”) and the column of samples on the right are all the truly negative cases (which is why this column is marked “Actual Class Label = 0”). Imagine you run your algorithm on Sunday to predict Monday’s weather in a whole pile of locations, and then on Tuesday you compare your algorithm’s output to the actual answers (now that Monday’s weather is known for all those locations). Those actual answers are the ground truth, i.e., the class labels 1=sunny (filled circles) and 0=not sunny (empty circles). So how did your algorithm do? Perhaps your algorithm was able to predict some of those 1=sunny cases. Those are the true positives, i.e., filled circles in the green half of the circle. Unfortunately, it also predicted 1=sunny for some of the cases that are marked as 0=not sunny in the ground truth. Those are false positives, i.e., the empty circles in the pink half of the circle.
In addition, the algorithm may not be able to predict 1=sunny for several cases, i.e., the false negatives in the left column outside of the circle. Happily, the true positives inside the green half of the circle and the true negatives in the right column outside the circle are good news: the algorithm matches the ground truth in those cases. But the false positives will hurt precision and the false negatives will hurt recall. The goal is to keep false positives and false negatives as low as possible, but algorithms usually favor reduction of one over the other and may not be able to reduce both.
Yes. Refer to the table provided at the end of the supplemental readings for Module 9.